0% found this document useful (0 votes)
6 views

Unit 1 Notes DM

The document provides an introduction to data mining for business intelligence, covering its necessity due to the explosive growth of data and the process of knowledge discovery from databases (KDD). It outlines the steps involved in the KDD process, the architecture of data mining systems, and the types of data suitable for mining, including relational and transactional databases. Additionally, it discusses the characteristics, definitions, and objectives of data mining, emphasizing its interdisciplinary nature and the importance of extracting useful patterns from large datasets.

Uploaded by

worlddependsonme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 1 Notes DM

The document provides an introduction to data mining for business intelligence, covering its necessity due to the explosive growth of data and the process of knowledge discovery from databases (KDD). It outlines the steps involved in the KDD process, the architecture of data mining systems, and the types of data suitable for mining, including relational and transactional databases. Additionally, it discusses the characteristics, definitions, and objectives of data mining, emphasizing its interdisciplinary nature and the importance of extracting useful patterns from large datasets.

Uploaded by

worlddependsonme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

BA5021 DATA MINING FOR BUSINESS INTELLIGENCE

UNIT I

INTRODUCTION

Syllabus:
Data mining, Text mining, Web mining, Spatial mining, Process mining, BI
process- Private and Public intelligence, Strategic assessment of
implementing BI

1.1. DATA MINING

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability

 Automated data collection tools, database systems,


Web, computerized society

 Major sources of abundant data

 Business: Web, e-commerce, transactions, stocks, …

 Science: Remote sensing, bioinformatics, scientific


simulation, …

 Society and everyone: news, digital cameras, YouTube

 We are drowning in data, but starving for knowledge!

 ―Necessity is the mother of invention‖—Data mining—Automated


analysis of massive data sets

What is Data Mining?

 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) patterns or knowledge from
huge amount of data

1 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Data mining: a misnomer?

 Alternative names

 Knowledge discovery (mining) in databases (KDD),


knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.

 Watch out: Is everything ―data mining‖?

 Simple search and query processing

 (Deductive) expert systems

Knowledge Discovery in DB (KDD) Process

 This is a view from typical database systems and data


warehousing communities

 Data mining plays an essential role in the knowledge discovery


process

2 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Knowledge discovery as a process involves in the following steps:

1. Data cleaning

To remove noise and inconsistent data

2. Data integration

where multiple data sources may be combined

3. Data selection

where data relevant to the analysis task are retrieved from


the database

4. Data transformation

where data are transformed or consolidated into forms


appropriate for mining by performing summary or
aggregation operations, for instance

5. Data mining

an essential process where intelligent methods are applied in


order to extract data patterns

6. Pattern evaluation

To identify the truly interesting patterns representing


knowledge based on some interestingness measures

7. Knowledge presentation

where visualization and knowledge representation


techniques are used to present the mined knowledge to the
user

3 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

KDD Process: A Typical View from ML and Statistics

Data Mining in Business Intelligence

4 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Architecture of typical DM Systems

Based on KDD’s view, the architecture of a typical data mining system


may have the following major components

 Database, data warehouse or other information repository:

 This is one or a set of databases, data warehouses,


spreadsheets, or other kinds of information repositories.

 Data cleaning and data integration techniques may be


performed on the data.

 Database or data warehouse server:

 The database or data warehouse server is responsible for


fetching the relevant data, based on the user’s data mining
request.

 Knowledge base:

 This is the domain knowledge that is used to guide the


search or evaluate the interestingness of resulting patterns.

5 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Such knowledge can include concept hierarchies, used to


organize attributes or attribute values into different levels of
abstraction.

 Other examples of domain knowledge are additional


interestingness constraints or thresholds, and metadata

 Data mining engine:

 This is essential to the data mining system and ideally


consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.

 Pattern evaluation module :

 This component typically employs interestingness measures


and interacts with the data mining modules so as to focus the
search toward interesting patterns.

 It may use interestingness thresholds to filter out discovered


patterns.

 User interface:

 This module communicates between users and the data


mining system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data
mining results

Data Mining: On What Kinds of Data?

There are no. of data stores on which data mining can be performed:

 Relational database

 Data warehouse

 Transactional database

6 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Advanced database and information repository

 Spatial and temporal data

 Time-series data

 Stream data

 Multimedia database

 Text databases & WWW

 Relational database

 A relational database is a collection of tables, each of which


is assigned a unique name.
 Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
 Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values
 Data warehouse
 A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
that usually resides at a single site.
 Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading,
and periodic data refreshing.

Figure: Data Warehouse

7 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 To facilitate decision making, the data in a data warehouse are


organized around major subjects, such as customer, item,
supplier, and activity.

 A data warehouse is usually modeled by a multidimensional


database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema.

 Each cell stores the value of some aggregate measure, such as


count or sales amount.

 The actual physical structure of a data warehouse may be a


relational data store or a multidimensional data cube.

 A data cube provides a multidimensional view of data and


allows the pre computation and fast accessing of summarized
data.

 A data cube for summarized sales data of All Electronics is


presented in Figure.

 The cube has three dimensions:

 address (with city values Chicago, New York, Toronto,


Vancouver),

 time (with quarter values Q1, Q2, Q3, Q4), and

 item(with item type values home entertainment,


computer, phone, security).

 The aggregate value stored in each cell of the cube is sales


amount (in thousands).

 A data warehouse collects information about subjects that


span an entire organization, and thus its scope is enterprise-
wide.

 A data mart, on the other hand, is a department subset of a


data warehouse. It focuses on selected subjects, and thus its
scope is department-wide

8 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 By providing multidimensional data views and the pre


computation of summarized data, data warehouse systems are
well suited for on-line analytical processing, or OLAP.

9 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Transactional database
 In general, a transactional database consists of a file where each
record represents a transaction.

 A transaction typically includes a unique transaction identity


number (trans ID) and a list of the items making up the transaction
(such as items purchased in a store).

Advanced database and information repository:

 Object-Relational Databases

 Constructed based on an object-relational data model.


 This model extends the relational model by providing a rich data
type for handling complex objects and object orientation.
 Object-relational data model inherits the essential concepts of
object-oriented databases, where, in general terms, each entity
is considered as an object.

 Temporal Databases, Sequence Databases, and Time-Series


Databases

 A temporal database typically stores relational data that include


time-related attributes.
 These attributes may involve several timestamps, each having
different semantics

 Temporal Databases

 A temporal database typically stores relational data that


include time-related attributes.

10 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 These attributes may involve several timestamps, each


having different semantics

 Sequence Databases

 A sequence database stores sequences of ordered events,


with or without a concrete notion of time.

 Examples include customer shopping sequences, Web click


streams, and biological sequences

 Time-Series Databases
 A time-series database stores sequences of values or events
obtained over repeated measurements of time (e.g., hourly,
daily, weekly).
 Examples include data collected from the stock exchange,
inventory control, and the observation of natural phenomena
(like temperature and wind).

 Spatial Databases

 Spatial databases contain spatial-related information.


 Examples include geographic (map) databases, very large-
scale integration (VLSI) or computed-aided design databases,
and medical and satellite image databases.

 Text Databases and Multimedia Databases

 Text databases are databases that contain word descriptions


for objects.
 These word descriptions are usually not simple keywords but
rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages,
summary reports, notes, or other documents.
 Multimedia databases store image, audio, and video data. They
are used in applications such as picture content-based retrieval,
voice-mail systems, video-on-demand systems, the World Wide
Web, and speech-based user interfaces that recognize spoken
commands

11 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Data Streams

 Data flow in and out of an observation platform (or window)


dynamically.
 Such data streams have the following unique features: huge
or possibly infinite volume, dynamically changing, flowing in
and out in a fixed order, allowing only one or a small number
of scans, and demanding fast (often real-time) response
time.
 Examples : scientific and engineering data, time-series data,
and data produced in other dynamic environments, etc.

DATA MINING CONCEPTS AND APPLICATIONS- Data Mining


Definitions, Characteristics, and Benefits:

 Data mining is a term used to describe discovering or "mining"


knowledge from large amounts of data.

 Technically speaking, data mining is a process that uses


statistical, mathematical, and artificial intelligence techniques
to extract and identify useful information and subsequent
knowledge (or patterns) from large sets of data.

 These patterns can be in the form of business rules, affinities,


correlations, trends, or prediction models

 Most literature defines data mining as "the nontrivial process of


identifying valid, novel, potentially useful, and ultimately
understandable patterns in data stored in structured databases,"
where the data are organized in records structured by
categorical, ordinal, and continuous variables.

The meanings of the key terms are as follows:

 Nontrivial means that some experimentation-type search or


inference is involved ;

 Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.

12 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Novel means that the patterns are not previously known

 Potentially useful means that the discovered patterns should lead


to some benefit to the user or task.

 Ultimately understandable means that the pattern should make


business sense that leads to the user saying "mmm!

 Data mining is not a new discipline, but rather a new definition for
the use of many disciplines.

 Data mining is tightly positioned at the intersection of many


disciplines, including statistics, artificial intelligence, machine
learning, management science, information systems, and
databases.

13 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Major characteristics and objectives of data mining:

 Data are often buried deep within very large databases

 The data are cleansed and consolidated into a data warehouse.

 Data may be presented in a variety of formats.

 The data mining environment is usually a client/ server architecture


/ Web-based information systems architecture.

 Sophisticated new tools, including advanced visualization tools,


help to remove the information buried in corporate files or archival
public records.

 data miners are exploring the usefulness of soft data

 The miner is often an end user, empowered by data drills and


other power query tools to ask ad hoc questions and obtain
answers quickly

 Data mining tools are readily combined with sp read sheets and
other software development tools.

 It is sometimes necessary to use parallel processing for DM

A Simple Taxonomy of Data

 Data refers to a collection of facts usually obtained as the result


of experiences, observations, or experiments.

 Data may consist of numbers, letters, words, images, voice


recordings, and so on as measurements of a set of variables.

 Data are often viewed as the lowest level of abstraction from


which information and then knowledge is derived.

 At the highest level of abstraction, one can classify data as


structured and unstructured

 Structured data is what data mining algorithms use, and can be


classified as categorical or numeric.

 The categorical data can be subdivided into nominal or ordinal


data,
14 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 whereas numeric data can be subdivided into interval or ratio

 Categorical data

 represent the labels of multiple classes used to divide a variable


into specific groups.

 Examples of categorical variables include sex, age group, and


educational level.

 Nominal data

 contain measurements of simple codes assigned to objects as


labels, which are not measurements.
 For example, the variable marital status can be generally
categorized as (1) single, (2) married, and (3) divorced.
 Nominal data can be represented with binomial values having
two possible values (e.g., yes/ no, true/ false, good/ bad), or
multinomial values having three or more possible values.

 Ordinal data

 contain codes assigned to objects or events as labels that also


represent the rank order among them.

 For example, the variable credit score can be generally


categorized as (1) low, (2) medium, or (3) high.

15 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Similar ordered relationships can be seen in variables such as


age group (i.e., child, young, middle-aged, elderly)

 Numeric data

 represent the numeric values of specific variables.


 Examples of numerically valued variables include age, number
of children, total household income

 Interval data

 are variables that can be measured on interval scales.


 A common example of interval scale measurement is
temperature on the Celsius scale.

 Ratio data

 include measurement variables commonly found in the physical


sciences and engineering. Mass, length, time, plane angle,
energy, and electric charge.

How Data Mining Works?

 Using existing and relevant data, data mining builds models to


identify patterns among the attributes presented in the data set.

 Models are the mathematical representations that identify the


patterns among the attributes of the objects described in the data
set.

 Some of these patterns are explanatory (explaining the


interrelationships and affinities among the attributes), whereas
others are predictive (foretelling future values of certain
attributes).

In general, data mining seeks to identify four major types of


patterns:

1. Associations

find the commonly co-occurring groupings of things, such as


beer and diapers going together in market-basket analysis.

16 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Predictions

tell the nature of future occurrences of certain events based


on what has happened in the past, such as predicting the winner of the
Game or forecasting the absolute temperature of a particular day.

3. Clusters

identify natural groupings of things based on their known


characteristics, such as assigning customers in different segments
based on their demographics and past purchase behaviors.

4. Sequential relationships

discover time-ordered events, such as predicting that an


existing banking customer who already has a checking account will open
a savings account followed by an investment account within a year.

 Data mining tasks can be classified into three main categories:

 prediction,

 association, and

 clustering.

 Based on the way in which the patterns are extracted from the
historical data, the learning algorithms of data mining methods
can be classified as either

 supervised or

 unsupervised.

 Supervised learning algorithms - the training data


includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as
the class attribute (i.e. , output variable or result
variable).
 Unsupervised learning algorithm - the training data
includes only the descriptive attributes.

17 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

PREDICTION

 Prediction is commonly referred to as the act of telling about the


future.

 It differs from simple guessing by taking into account the


experiences, opinions, and other relevant information in
conducting the task of foretelling.

 A term that is commonly associated with prediction is forecasting.

 Prediction is largely experience and opinion based, forecasting is


data and model based.

 That is, in order of increasing reliability, one might list the relevant
terms as guessing, predicting, and forecasting, respectively.

18 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLASSIFICATION (supervised induction)

 The objective of classification is to analyze the historical data


stored in a database and automatically generate a model that can
predict future behavior.

 This induced model consists of generalizations over the records of


a training dataset, which help distinguish predefined classes.

 The hope is that the model can then be used to predict the classes
of other unclassified records and, more important, to accurately
predict actual future events.

 Common classification tools include neural networks and decision


trees, logistic regression and discriminate analysis.

 Emerging tools such as rough sets, support vector machines, and


genetic algorithms

 Neural networks

 Involve the development of mathematical structures


(somewhat resembling the biological neural networks
in human brain) that have the capability to learn from
past experiences presented in the form of well-
structured datasets

 Decision trees

 Classify data into a finite number of classes based on


the values of the input variables.
 Decision trees are essentially a hierarchy of if-then
statements
 Faster than neural networks.
 They are most appropriate for categorical and interval
data.
 Therefore, incorporating continuous variables into a
decision tree framework requires discretization -
converting continuous valued numerical variables to
ranges and categories

19 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLUSTERING

 Clustering partitions a collection of things (e.g., objects and


events presented in a structured dataset) into segments (or
natural groupings) whose members share similar characteristics.

 Unlike in classification, in clustering, the class labels are unknown.

 As the selected algorithms go through the dataset, identifying the


commonalities of things based on their characteristics, the clusters
are established.

 Because the clusters are determined using a heuristic-type


algorithm, and because different algorithms may end up with
different sets of clusters for the same dataset.

 It may be necessary for an expert to interpret, and potentially


modify, the suggested clusters before the results of clustering
techniques are put to actual use.

 After reasonable clusters have been identified, they can be used to


classify and interpret new data.

 The goal of clustering is to create groups so that the members


within each group have maximum similarity and the members
across groups have minimum similarity.

 The most commonly used clustering techniques include k-means


(from statistics) and self-organizing maps (from machine
learning), which is a unique neural network architecture developed
by Kohonen.

ASSOCIATIONS

 Associations, or association rule learning in data mining, is a


popular and well-researched technique for discovering interesting
relationships among variables in large databases.

 In the context of the retail industry, association rule mining is often


called market-basket analysis.

20 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Two commonly used derivatives of association rule mining are link


analysis and sequence mining.

 Link analysis - the linkage among many objects of interest


is discovered automatically, such as the link between Web
pages and referential relationships among groups of
academic publication authors.

 Sequence mining - relationships are examined in terms of


their order of occurrence to identify associations over time

HYPOTHESIS- OR DISCOVERY-DRIVEN DATA MINING

 Data mining can be hypothesis driven or discovery driven.

 Hypothesis-driven data mining begins with a proposition by the


user, who then seeks to validate the truthfulness of the proposition.

 For example, a marketing manager may begin with the following


proposition: ―Are DVD player sales related to sales of television
sets?‖

 Discovery-driven data mining finds patterns, associations, and


other relationships hidden within datasets. It can uncover facts that
an organization had not previously known or even contemplated

DATA MINING APPLICATIONS

• Customer relationship management.

• Banking.

• Retailing and logistics.

• Manufacturing and production.

• Brokerage and securities trading.

• Insurance.

• Computer hardware and software.

• Government and defense.

• Travel industry (airlines, hotels/resorts, rental car companies).


21 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

• Health care.

• Medicine.

• Entertainment industry.

• Homeland security and law enforcement.

• Sports.

Data mining has become a popular tool in addressing many complex


businesses issues.

1. Customer relationship management.

 Customer relationship management (CRM) is the new and


emerging extension of traditional marketing.

 The goal of CRM is to create one-on-one relationships with


customers by developing an intimate understanding of their needs
and wants.

 As businesses build relationships with their customers over time


through a variety of transactions (e.g., product inquiries, sales,
service requests, warranty calls)

 When combined with demographic and socioeconomic attributes,


this information-rich data can be used to

(1) identify most likely responders / buyers of new


products/services (i.e., customer profiling);

(2) understand the root causes of customer attrition in order


to improve customer retention (i.e., churn analysis);

(3) discover time-variant associations between products and


services to maximize sales and customer value;

(4) identify the most profitable customers and their


preferential needs to strengthen relationships and to
maximize sales.

22 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Banking.

 Data mining can help banks with the following:

(1) automating the loan application process by accurately


predicting the most probable defaulters;

(2) detecting fraudulent credit card and online-banking


transactions;

(3) identifying ways to maximize customer value by selling them


products and services that they are most likely to buy;

(4) optimizing the cash return by accurately forecasting the cash


flow on banking entities (e.g., ATM machines, banking branches).

3. Retailing and logistics.

 In the retailing industry, data mining can be used to

(1) predict accurate sales volumes at specific retail locations in


order to determine correct inventory levels;

(2) identify sales relationships between different products (with


market-basket analysis) to improve the store layout and optimize
sales promotions;

(3) forecast consumption levels of different product types (based


on seasonal and environmental conditions) to optimize logistics
and hence maximize sales;

(4) discover interesting patterns in the movement of products


(especially for the products that have a limited shelf life because
they are prone to expiration, perishability, and contamination) in a
supply chain by analyzing sensory and RFID data.

4. Manufacturing and production.

 Manufacturers can use data mining to

(1) predict machinery failures before they occur through the use of
sensory data (enabling what is called condition-based
maintenance);

23 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

(2) identify anomalies and commonalities in production systems to


optimize manufacturing capacity; and

(3) discover novel patterns to identify and improve product quality

5. Brokerage and securities trading.

 Brokers and traders use data mining to

(1) predict when and how much certain bond prices will change;

(2) forecast the range and direction of stock fluctuations;

(3) assess the effect of particular issues and events on overall


market movements; and

(4) identify and prevent fraudulent activities in securities trading

6. Insurance.

 The insurance industry uses data mining techniques to

(1) forecast claim amounts for property and medical coverage


costs for better business planning;

(2) determine optimal rate plans based on the analysis of claims


and customer data;

(3) predict which customers are more likely to buy new policies
with special features; and

(4) identify and prevent incorrect claim payments and fraudulent


activities.

7. Computer hardware and software.

 Data mining can be used to

(1) predict disk drive failures well before they actually occur;

(2) identify and filter unwanted Web content and e-mail messages;

(3) detect and prevent computer network security bridges; and

(4) identify potentially unsecure software products.

24 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

8. Government and defense.

 Data mining also has a number of military applications. It can be


used to

(1) forecast the cost of moving military personnel and equipment;

(2) predict an adversary’s moves to develop more successful


strategies for military engagements;

(3) predict resource consumption for better planning and


budgeting; and

(4) identify classes of unique experiences, strategies, and lessons


learned from military operations for better knowledge sharing

9. Travel industry (airlines, hotels / resorts, rental car companies).

 Data mining has a variety of uses in the travel industry. It is


successfully used to

(1) predict sales of different services (seat types in airplanes, room


types in hotels/resorts, car types in rental car companies) in order
to optimally price services to maximize revenues as a function of
time-varying transactions (commonly referred to as yield
management);

(2) forecast demand at different locations to better allocate limited


organizational resources;

(3) identify the most profitable customers and provide them with
personalized services to maintain their repeat business; and

(4) retain valuable employees by identifying and acting on the root


causes for attrition

10. Health care.

 Data mining has a number of health care applications. It can be


used to

(1) identify people without health insurance and the factors


underlying this undesired phenomenon;

25 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

(2) identify novel cost-benefit relationships between different

treatments to develop more effective strategies;

(3) forecast the level and the time of demand at different service
locations to optimally allocate organizational resources; and

(4) understand the underlying reasons for customer and employee


attrition.

11. Medicine.

(1) identify novel patterns to improve survivability of patients with


cancer;

(2) predict success rates of organ transplantation patients to


develop better donor-organ matching policies;

(3) identify the functions of different genes in the human


chromosome (known as genomics);

(4) discover the relationships between symptoms and illnesses to


help medical professionals make informed and correct decisions in
a timely manner.

12. Entertainment industry.

 Data mining is successfully used by the entertainment industry to

(1) analyze viewer data to decide what programs to show during


prime time and how to maximize returns by knowing where to
insert advertisements;

(2) predict the financial success of movies before they are


produced to make investment decisions and to optimize the
returns;

(3) forecast the demand at different locations and different times to


better schedule entertainment events and to optimally allocate
resources; and

(4) develop optimal pricing policies to maximize revenues.

26 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

13. Homeland security and law enforcement.

 Data mining has a number of homeland security and law


enforcement applications. Data mining is often used to

(1) identify patterns of terrorist behaviors

(2) discover crime patterns (e.g., locations, timings, criminal


behaviors, and other related attributes) to help solve criminal
cases in a timely manner;

(3) predict and eliminate potential biological and chemical attacks


to a nation’s critical infrastructure by analyzing special-purpose
sensory data; and

(4) identify and stop malicious attacks on critical information


infrastructures (often called information warfare).

14.Sports.

 Data mining was used to improve the performance of National

(1) Basketball Association (NBA) teams in the United States.

(2) The NBA developed Advanced Scout, a PC-based data mining


application that coaching staff use to discover interesting patterns
in basketball game data.

(3) The pattern interpretation is facilitated by allowing the user to


relate patterns to videotape.

(4) See Bhandari et al.

(5) (1997) for details.

DATA MINING PROCESS

 In order to systematically carry out data mining projects, a general


process is usually followed.

 Based on best practices, data mining researchers and practitioners


have proposed several processes (workflows or simple step-by-
step approaches)

27 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 One such standardized process, the most popular one, Cross-


Industry Standard Process for Data Mining—CRISP-DM

 CRISP - DM is a sequence of six steps.

 Starts with a good understanding of the business and ends with


the deployment of the solution that satisfied the specific business
need.

 Even though these steps are sequential in nature, there is usually


a great deal of backtracking.

 Because the data mining is driven by experience and


experimentation, depending on the problem situation and the
knowledge/experience of the analyst, the whole process can be
very iterative

28 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Step 1: Business Understanding

 A thorough understanding of the managerial need for new


knowledge and an explicit specification of the business objective

 Specific questions such as

―What are the common characteristics of the customers we


have lost to our competitors recently?‖ or

―What are typical profiles of our customers, and how much


value does each of them provide to us?‖

need to be addressed.

 Then a project plan for finding such knowledge is developed that


specifies the people responsible for collecting the data, analyzing
the data, and reporting the findings.

 At this early stage, a budget to support the study should also be


established

Step 2: Data Understanding

 Different business tasks require different sets of data.

 The main activity of the data mining process is to identify the


relevant data from many available databases.

 Some key points must be considered in the data identification


and selection phase.

 First and foremost, the analyst should be clear and concise about
the description of the data mining task so that the most relevant
data can be identified.

 For example, a retail data mining project may seek to identify


spending behaviors of female shoppers, who purchase seasonal
clothes, based on their demographics, credit card transactions,
and socioeconomic attributes.

 Furthermore, the analyst should build an intimate understanding of


the data sources

29 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Example:

 where the relevant data are stored and in what form;

 what the process of collecting the data is—automated versus


manual;

 who the collectors of it are;

 and how often it is updated etc

 In order to better understand the data, the analyst often uses a


variety of statistical and graphical techniques

 Such as simple statistical summaries of each variable (e.g., for


numeric variables, the average, minimum/maximum, median,
standard deviation are among the calculated measures, etc)

 Data sources for data selection can vary.

 Normally, data sources for business applications include


demographic data (such as income, education, number of
households, and age),

sociographic data (such as hobby, club membership, and


entertainment),

transactional data (sales record, credit card spending, and issued


checks), and so on.

 Data can be categorized as quantitative and qualitative.

Step 3: Data Preparation

 The purpose of data preparation (or more commonly called as data


preprocessing) is to take the data identified in the previous step
and prepare them for analysis by data mining methods.

 Data are generally

Incomplete (lacking attribute values, lacking certain attributes of


interest, or containing only aggregate data),

Noisy (containing errors or outliers), and

30 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Inconsistent (containing discrepancies in codes or names).

 The four main steps needed to convert the raw, real-world data
into minable datasets.

 Data Collection / Selection

 Data Cleaning

 Data Transformation

 Data Reduction

1. Data Collection / Selection

 The relevant data are collected from the identified


sources

 The necessary records and variables are selected, and

 The records coming from multiple data sources are


integrated

31 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Data Cleaning

 The data are cleaned (this step is also known as


data scrubbing).

 The values in the dataset are identified and dealt


with.

 Missing values - need to be imputed (filled with a


most probable value) or ignored;

 Noisy values - (i.e., the outliers) and smooth them


out.

 Inconsistencies - (unusual values within a


variable) in the data should be handled using
domain knowledge and/or expert opinion.

3. Data Transformation

 Data are transformed for better processing.

 For instance, in many cases, the data are


normalized in order to mitigate the potential bias of
one variable (having large numeric values, such as
for household income)

 Another transformation that takes place is


discretization and/or aggregation.

 In some cases, the numeric variables are converted


to categorical values (e.g., low, medium, and high);

 In other cases, a nominal variable’s unique value


range is reduced to a smaller set using concept
hierarchies.

 Even though data miners like to have large


datasets, too much data is also a problem.

 One can visualize the data commonly used in data


mining projects as a flat file consisting of two
dimensions:
32 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 In some cases (e.g., image processing and genome


projects with complex microarray data), the number
of variables can be rather large, and the analyst
must reduce the number down to a manageable
size.

 The variables are treated as different dimensions


that describe the phenomenon from different
perspectives, in data mining, this process is
commonly called dimensional reduction.

Step 4: Modeling Building

 In this step, various modeling techniques are selected and applied


to an already prepared dataset in order to address the specific
business need.
 Depending on the business need, the data mining task can be of a
prediction (either classification or regression), an association, or a
clustering type.
 Each of these data mining tasks can use a variety of data mining
methods and algorithms.
 Some of these data mining methods and some of the most popular
algorithms - decision trees for classification, k-means for
clustering, and the Apriori algorithm for association rule mining.

Step 5: Testing and Evaluation

 The developed models are assessed and evaluated for their


accuracy and generality.
 This step assesses the degree to which the selected model (or
models) meets the business objectives and, if so, to what extent
 Another option is to test the developed model(s) in a real-world
scenario if time and budget constraints permit.
 Even though the outcome of the developed models are expected
to relate to the original business objectives,

 The testing and evaluation step is a critical and challenging task.

33 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 No value is added by the data mining task until the business value
obtained from discovered knowledge patterns is identified and
recognized.

 Determining the business value from discovered knowledge


patterns is somewhat similar to playing with puzzles.

 The success of this identification operation depends on the


interaction among data analysts, business analysts, and
decision makers

Step 6: Deployment

 Development and assessment of the models is not the end of the


data mining project.

 The knowledge gained from such exploration will need to be


organized and presented in a way that the end user can
understand and benefit from it.

 Depending on the requirements, the deployment phase can be as


simple as generating a report or as complex as implementing a
repeatable data mining process across the enterprise.

 In many cases, it is the customer, not the data analyst, who carries
out the deployment steps.

 The deployment step may also include maintenance activities for


the deployed models

34 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Other Data Mining Standardized Processes and Methodologies

Ranking of Data Mining Processes and Methodologies

35 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

DATA MINING METHODS

 A variety of methods are available for performing data mining


studies, including classification, regression, clustering, and
association.

 Most data mining software tools employ more than one technique
(or algorithm) for each of these methods.

1. Classification

 Classification is perhaps the most frequently used data mining


method

 A popular member of the machine-learning family of techniques,

 Classification learns patterns from past data (a set of information—


traits, variables, features—on characteristics of the previously
labeled items, objects, or events) in order to place new instances
(with unknown labels) into their respective groups or classes.

 For example, one could use classification to predict whether the


weather on a particular day will be ―sunny,‖ ―rainy,‖ or ―cloudy.‖

 Popular classification tasks include credit approval (i.e., good or


bad credit risk),

 store location (e.g., good, moderate, bad), target marketing (e.g.,


likely customer, no hope),

 fraud detection (i.e., yes, no), and telecommunication (e.g., likely


to turn to another phone company, yes/no).

 If what is being predicted is a class label (e.g., ―sunny,‖ ―rainy,‖ or


―cloudy‖), the prediction problem is called a classification.

 whereas if it is a numeric value (e.g., temperature such as 68°F),


the prediction problem is called a regression.

 The most common two-step methodology of classification-type


prediction involves

36 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 model development/training and

 model testing/deployment.

 In the model development phase, a collection of input data,


including the actual class labels, is used.
 After a model has been trained, the model is tested against the
holdout sample for accuracy assessment to predict classes of new
data instances (where the class label is unknown)

Several factors are considered in assessing the model, including the


following:

1. Predictive accuracy.

 The model’s ability to correctly predict the class label of new or


previously unseen data.

 Prediction accuracy is the most commonly used assessment


factor

 To compute this measure, actual class labels of a test dataset


are matched against the class labels predicted by the model.

 The accuracy can then be computed as the accuracy rate,


which is the percentage of test dataset samples correctly
classified by the model

2. Speed.
 The computational costs involved in generating and using the
model, where faster is deemed to be better.
3. Robustness.

 The model’s ability to make reasonably accurate predictions,


given noisy data or data with missing and erroneous values.

4. Scalability.

 The ability to construct a prediction model efficiently given a


rather large amount of data.

37 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

5. Interpretability.
 The level of understanding and insight provided by the model
(e.g., how and/or what the model concludes on certain
predictions).

Estimating the True Accuracy of Classification Models

 In classification problems, the primary source for accuracy


estimation is the confusion matrix (also called a classification
matrix or a contingency table).

 The numbers along the diagonal (L – R) correct decisions, and the


numbers outside this diagonal represent the errors.

38 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

SIMPLE SPLIT

 The simple split partitions the data into two mutually exclusive
subsets called a training set and a test set (or holdout set).

 Two-thirds of the data - training set ; remaining one-third - test


set.

 Training set - used by the inducer (model builder), and the built
classifier is then tested on the test set

K-FOLD CROSS-VALIDATION (rotation estimation)

 The complete dataset is randomly split into k mutually exclusive


subsets of approximately equal size.

 The classification model is trained and tested k times.

39 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Each time, it is trained on all but one fold and then tested on the
remaining single fold.

ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES

1. Leave-one-out.

 similar to the k-fold cross-validation

 every data point is used for testing once

 This is a time consuming methodology, but for small datasets,


sometimes it is a viable option.

2. Bootstrapping.

 With bootstrapping, a fixed number of instances from the


original data are sampled (with replacement) for training and the
rest of the dataset is used for testing.

 This process is repeated as many times as desired.

3. Jackknifing.

 Similar to the leave-one-out methodology;

 The accuracy is calculated by leaving one sample out at each


iteration of the estimation process.

4. Area under the ROC curve.

 The area under the ROC curve is a graphical assessment


technique

 true positive rate is plotted on the Y-axis and false positive rate
is plotted on the X-axis.

 The area under the ROC curve determines the accuracy


measure of a classifier: A value of 1 indicates a perfect
classifier whereas 0.5 indicates no better than random chance;

40 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLASSIFICATION TECHNIQUES

 Decision tree analysis.

 Statistical analysis.

 Neural networks.

 Case-based reasoning

 Bayesian classifiers

 Genetic algorithms

 Rough sets

2. Cluster Analysis

 Data mining method for classifying items, events, or concepts into


common groupings called clusters.

 The method is commonly used in biology, medicine, genetics,


social network analysis, anthropology, archaeology, astronomy,
character recognition, and even in management information
system development.

 Cluster analysis is an exploratory data analysis tool for solving


classification problems.

 The objective is to sort cases (e.g., people, things, events) into


groups, or clusters, so that the degree of association is strong
among members of the same cluster and weak among members
of different clusters.

 Each cluster describes the class to which its members belong.

Cluster analysis results may be used to:

 Identify a classification scheme (e.g., types of customers)


 Suggest statistical models to describe populations
 Indicate rules for assigning new cases to classes for identification,
targeting, and diagnostic purposes
 Provide measures of definition, size, and change in what were
previously broad concepts

41 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Find typical cases to label and represent classes


 Decrease the size and complexity of the problem space for other
data mining methods
 Identify outliers in a specific domain (e.g., rare-event detection)

DETERMINING THE OPTIMAL NUMBER OF CLUSTERS

 Clustering algorithms usually require one to specify the number of


clusters to find.

 If this number is not known from prior knowledge, it should be


chosen in some way

The following are among the most commonly referenced ones:

 Look at the percentage of variance explained as a function of the


number of clusters; that is, choose a number of clusters so that
adding another cluster would not give much better modeling of the
data.

 Set the number of clusters to (n/2)1/2, where n is the number of


data points.

 Use the Akaike Information Criterion, which is a measure of the


goodness of fit

 Use Bayesian Information Criterion, which is a model-selection


criterion to determine the number of clusters

ANALYSIS METHODS

 Statistical methods

 Neural networks

 Fuzzy logic

Each of these methods generally works with one of two general method
classes:

 Divisive.

All items start in one cluster and are broken apart

42 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Agglomerative

all items start in individual clusters, and the clusters are


joined together.

K-MEANS CLUSTERING ALGORITHM

 The k-means clustering algorithm (where k stands for the


predetermined number of clusters) is arguably the most referenced
clustering algorithm.

 It has its roots in traditional statistical analysis. As the name


implies, the algorithm assigns each data point (customer, event,
object, etc.) to the cluster whose center (also called centroid) is the
nearest.

 The center is calculated as the average of all the points in the


cluster; that is, its coordinates are the arithmetic mean for each
dimension separately over all the points in the cluster.

 Initialization step: Choose the number of clusters (i.e., the value


of k).

 Step 1: Randomly generate k random points as initial cluster


centers.

 Step 2: Assign each point to the nearest cluster center.

 Step 3: Recompute the new cluster centers.

 Repetition step: Repeat steps 2 and 3 until some convergence


criterion is met (usually that the assignment of points to clusters
becomes stable).

43 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Association Rule Mining

 Association rule mining is a popular data mining method

 Association rule mining aims to find interesting relationships


(affinities) between variables (items) in large databases.

 It is commonly called a market-basket analysis.

 The main idea in market basket analysis is to identify strong


relationships among different products (or services) that are
usually purchased together

 The outcome of the analysis is invaluable information that can be


used to better understand customer-purchase behavior in order to
maximize the profit from business transactions.

 A business can take advantage of such knowledge by

(1) putting the items next to each other to make it more convenient
for the customers to pick them

(2) promoting the items as a package (do not put one on sale if

the other(s) is on sale); and

(3) placing them apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing
and buying other items

 Applications of market-basket analysis include cross- marketing,


cross-selling, store design, catalog design,e-commerce site
design, optimization of online advertising, product pricing, and
sales/promotion configuration.

 ―Are all association rules interesting and useful?‖

 In order to answer such a question, association rule mining uses


two common metrics: support and confidence

 Several algorithms are available for generating association rules.


Some well-known algorithms include

44 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Apriori,
 Eclat, and
 FP-Growth.
 These algorithms only do half the job, which is to identify the
frequent itemsets in the database.

 A frequent itemset is an arbitrary number of items that frequently


go together in a transaction

 Once the frequent itemsets are identified, they need to be


converted into rules with antecedent and consequent parts.

 APRIORI ALGORITHM

 The Apriori algorithm is the most commonly used algorithm to


discover association rules.

 Given a set of itemsets (e.g., sets of retail transactions, each


listing individual items purchased)

 The algorithm attempts to find subsets that are common to at


least a minimum number of the itemsets (i.e., complies with a
minimum support).

 Apriori uses a bottom-up approach, where frequent subsets are


extended one item at a time (a method known as candidate
generation, whereby the size of frequent subsets increases
from one-item subsets to two-item subsets, then three-item
subsets, etc.),

 Groups of candidates at each level are tested against the data


for minimum support. The algorithm terminates when no further
successful extensions are found.

45 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

TEXT MINING

Text mining (text data mining or knowledge discovery in textual


databases)

 It is the semi automated process of extracting patterns from large


amounts of unstructured data sources.

 Text mining is the same as data mining;

 But with text mining, the input to the process is a collection of


unstructured (or less structured) data files such as Word
documents, PDF files, text excerpts, XML files, and so on.

 Text mining has two main steps

1. Imposing structure to the text-based data sources

2. Extracting relevant information and knowledge from this


structured text-based data using data mining techniques and
tools

TEXT MINING CONCEPTS AND DEFINITIONS

Benefits of text mining

 Law (court orders)

 Academic research (research articles)

 Finance (quarterly reports)

46 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Medicine (discharge summaries)

 Biology (molecular interactions)

 Technology (patent files), and

 Marketing (customer comments)

Example

 Free-form text-based interactions with customers in the form of


complaints

 Electronic communications and e-mail.

 Used to classify and filter junk e-mail

 Used to automatically prioritize e-mail based on importance


level as well as to generate automatic responses

Application areas of text mining:

 Information extraction - Identification of key phrases and


relationships

 Topic tracking - Based on a user profile and documents that a


user views, text mining can predict other documents

 Summarization - To save time on the part of the reader.

 Categorization - Identifying the main themes of a document and


then placing them into a predefined set of categories based on
those themes.

 Clustering - Grouping similar documents without having a


predefined set of categories.

 Concept linking - Connects related documents by identifying their


shared concepts

 Question answering - Finding the best answer to a given


question through knowledge-driven pattern matching.

47 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

NATURAL LANGUAGE PROCESSING

 NLP is an important component of text mining

 NLP is a subfield of artificial intelligence and computational


linguistics.

 NLP studies the problem of ―understanding‖ the natural human


language

 NLP Converts the human language (such as textual documents)


into more formal representations (in the form of numeric and
symbolic data) that are easier for computer programs to
manipulate.

 The goal of NLP is to move beyond syntax-driven text


manipulation (which is often called ―word counting‖) to a true
understanding and processing of natural language

 natural human language is vague and that a true understanding of


meaning requires extensive knowledge of a topic

Challenges associated with the implementation of NLP

 Part-of-speech tagging - It is difficult to mark up terms in a text as


corresponding to a particular part of speech (such as nouns, verbs,
adjectives, and adverbs)

 Text segmentation - Some written languages, such as Chinese,


Japanese, and Thai, do not have single-word boundaries. In these
instances, the text-parsing task requires the identification of word
boundaries, which is often a difficult task.

 Word sense disambiguation - Many words have more than one


meaning. Selecting the meaning that makes the most sense can
only be accomplished by taking into account the context within
which the word is used.

 Syntactic ambiguity - The grammar for natural languages is


ambiguous; that is, multiple possible sentence structures often
need to be considered.

48 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Imperfect or irregular input - Foreign or regional accents and


vocal impediments in speech and typographical or grammatical
errors in texts make the processing of the language an even more
difficult task.

 Speech acts - A sentence can often be considered an action by


the speaker.

The sentence structure alone may not contain enough


information to define this action.

For example, ―Can you pass the class?‖ requests a


simple yes/no answer, whereas

―Can you pass the salt?‖ is a request for a physical


action to be performed..

 WordNet is a laboriously hand-coded database of English words,


their definitions, sets of synonyms, and various semantic relations
between synonym sets.

 It is a major resource for NLP applications, but it has proven to be


very expensive to build and maintain manually

 An important area of CRM, where NLP is making a significant


impact, is sentiment analysis.

 Sentiment analysis is a technique used to detect favorable and


unfavorable opinions toward specific products and services using a
large numbers of textual data sources (customer feedback in the
form of Web postings).

 NLP has successfully been applied to a variety of tasks via


computer programs to automatically process natural human
language.

Following are among the most popular of these tasks:

1. Information retrieval.

The science of searching for relevant documents, finding


specific information within them, and generating metadata as to their
contents.
49 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Information extraction.

A type of information retrieval whose goal is to automatically


extract structured information, such as categorized and contextually and
semantically well-defined data from a certain domain, using unstructured
machine readable documents.

3. Named-entity recognition.

Also known as entity identification and entity extraction, this


subtask of information extraction seeks to locate and classify atomic
elements in text into predefined categories

4. Question answering.

The task of automatically answering a question posed in


natural language;

To find the answer to a question, the computer program may


use either a prestructured database or a collection of natural language
documents (a text corpus such as the World Wide Web).

5. Automatic summarization.

The creation of a shortened version of a textual document by


a computer program that contains the most important points of the
original document.

6. Natural language generation.

Systems convert information from computer databases into


readable human language.

7. Natural language understanding.

Systems convert samples of human language into more


formal representations that are easier for computer programs to
manipulate.

8. Machine translation.

The automatic translation of one human language to another

50 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

9. Foreign language reading.

A computer program that assists a nonnative language


speaker to read a foreign language with correct pronunciation and
accents on different parts of the words.

10. Foreign language writing.

A computer program that assists a nonnative language user


in writing in a foreign language.

11. Speech recognition.

Converts spoken words to machine-readable input.

12. Text-to-speech.

Also called speech synthesis, a computer program


automatically converts normal language text into human speech.

13. Text proofing.

A computer program reads a proof copy of a text in order to


detect and correct any errors

14. Optical character recognition.

The automatic translation of images of handwritten,


typewritten, or printed text (usually captured by a scanner) into machine
editable textual documents

TEXT MINING APPLICATIONS

1. Marketing Applications

 Text mining can be used to increase cross-selling and up-selling


by analyzing the unstructured data generated by call centers

 blogs, user reviews of products at independent Web sites, and


discussion board postings are a gold mine of customer sentiments

 Text Mining used to predict customer perceptions and subsequent


purchasing behavior

51 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Security Applications

 ECHELON surveillance system – It is assumed to be capable of


identifying the content of telephone calls, faxes, e-mails, and other
types of data, intercepting information sent via satellites, public
switched telephone networks.

 EUROPOL developed an integrated system capable of accessing,


storing, and analyzing vast amounts of structured and unstructured
data sources in order to track transnational organized crime.

 The U.S. (FBI) and the (CIA), are jointly developed a


supercomputer data and text mining system. The system is
expected to create a gigantic data warehouse along with a variety
of data and text mining modules to meet the knowledge-discovery
needs of federal, state, and local law enforcement agencies.

 Text mining is in the area of deception detection

3. Biomedical Applications

 Experimental techniques such as DNA microarray analysis, serial


analysis of gene expression (SAGE), and mass spectrometry
proteomics, among others, are generating large amounts of data
related to genes and proteins.

 Knowing the location of a protein within a cell can help to


determine its potential as a drug target

4. Academic Applications

 Text Mining provides semantic cues to machines to answer


specific queries.

52 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

TEXT MINING PROCESS

Context diagram for the text mining process

As the context diagram indicates:

 The input into the text-based knowledge discovery process is the


unstructured as well as structured data collected, stored, and
made available to the process.

 The output of the process is the context-specific knowledge that


can be used for decision making.

 The controls, also called the constraints (inward connection to the


top edge of the box), of the process include

software and hardware limitations,

privacy issues, and the

53 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

difficulties related to processing of the text

 The mechanisms of the process include proper techniques,


software tools, and domain expertise.

 The text mining process can be broken down into three


consecutive tasks,

Task 1 : Establish the Corpus

Task 2: Create the Term–Document Matrix

Task 3: Extract Knowledge

 each of which has specific inputs to generate certain outputs.

The three steps Text Mining Processes

Task 1: Establish the corpus

 Collect all relevant unstructured data

(e.g., textual documents, XML files, emails, Web pages,


short notes, voice recordings…)

 Digitize, standardize the collection

(e.g., all in ASCII text files)

 Place the collection in a common place

(e.g., in a flat file, or in a directory as separate files)

54 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Task 2: Create the Term–by–Document Matrix

 The digitized and organized documents (the corpus) are used to


create the term–document matrix (TDM).

 In the TDM, rows represent the documents and columns represent


the terms.

 The relationships between the terms and documents are


characterized by indices (i.e., a relational measure that can be as
simple as the number of occurrences of the term in respective
documents).

nt g
e e rin
Terms k g em ine
t ris a na e ng e nt
en tm are pm
tm c elo
es je ftw v P
Documents inv pro so de SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...

 Should all terms be included?

 Stop words, include words

 Synonyms, homonyms

 Stemming

 What is the best representation of the indices (values in


cells)?

 Row counts; binary frequencies; log frequencies;


 Inverse document frequency
55 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 TDM is a sparse matrix. How can we reduce the


dimensionality of the TDM?
 Manual – a domain expert goes through it
 Eliminate terms with very few occurrences in very
few documents (?)
 Transform the matrix using singular value
decomposition (SVD)
 SVD is similar to principle component analysis

Task 3: Extract patterns/knowledge

 Classification (text categorization)


 Clustering (natural groupings of text)
 Improve search recall
 Improve search precision
 Scatter/gather
 Query-specific clustering
 Association
 Trend Analysis (…)

TEXT MINING TOOLS

 Following are some of the popular text mining tools, which we


classify as

 Commercial software tools

 Free software tools

1. Commercial Software Tools

 The following are some of the most popular software tools used for
text mining.

 Note that many companies offer demonstration versions of their


products on their Web sites.

1. ClearForest offers text analysis and visualization tools


(clearforest.com).

56 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. IBM Intelligent Miner Data Mining Suite, now fully integrated


into IBM’s InfoSphere Warehouse software, includes data and
text mining tools (ibm.com).

3. Megaputer Text Analyst offers semantic analysis of free-form


text, summarization, clustering, navigation, and natural language
retrieval with search dynamic refocusing (megaputer.com).

4. SAS Text Miner provides a rich suite of text processing and


analysis tools (sas.com).

5. SPSS Text Mining for Clementine extracts key concepts,


sentiments, and relationships from call-center notes, blogs, e-
mails, and other unstructured data and converts them to a
structured format for predictive modeling (spss.com).

6. The Statistica Text Mining engine provides easy-to-use text


mining functionally with exceptional visualization capabilities
(statsoft.com).

7. VantagePoint provides a variety of interactive graphical views


and analysis tools with powerful capabilities to discover knowledge
from text databases(vpvp.com).

8. The WordStat analysis module from Provalis Research analyzes


textual information such as responses to open-ended questions
and interviews (provalisresearch.com).

2. Free Software Tools

Free software tools, some of which are open source, are available from
a number of nonprofit organizations:

1. GATE is a leading open source toolkit for text mining. It has a


free open source framework (or SDK) and graphical development
environment (gate.ac.uk).

2. RapidMiner has a community edition of its software that includes


text mining modules (rapid-i.com).

3. LingPipe is a suite of Java libraries for the linguistic analysis of


human language (alias-i.com/lingpipe).

57 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. S-EM (Spy-EM) is a text classification system that learns from


positive and unlabeled examples (cs.uic.edu/~liub/S-EM/S-EM-
download.html).

5. Vivisimo/Clusty is a Web search and text-clustering engine


(clusty.com).

WEB MINING OVERVIEW

 Web mining (or Web data mining) is the process of discovering


intrinsic relationships (i.e., interesting and useful information) from
Web data, which are expressed in the form of textual, linkage, or
usage information.

 The Web is perhaps the world’s largest data and text repository.

 The amount of information on the Web is growing rapidly every


day.

 A lot of interesting information can be found online:

 whose homepage is linked to which other pages

 how many people have links to a specific Web page

 how a particular site is organized

Web also poses great challenges for effective and efficient knowledge
discovery:

1. The Web is too big for effective data mining

2. The Web is too complex.

Web pages lack a unified structure They contain far more


authoring style and content variation

3. The Web is too dynamic.

Not only does the Web grow rapidly, but its content is constantly
being updated. Blogs, news stories, stock market results, weather
reports

58 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. The Web is not specific to a domain.

Web users have very different backgrounds, interests, and usage


purposes.

5. The Web has everything.

Only a small portion of the information on the Web is truly relevant


or useful to someone

Three main areas of Web mining:

 Web content mining

 Web structure mining

 Web usage mining.

1. WEB CONTENT MINING

 Web content mining refers to the extraction of useful information


from Web pages.

 The documents may be extracted in some machine-readable


format so that automated techniques can generate some
information about the Web pages.

 Web crawlers are used to read through the content of a Web site
automatically.

59 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 The information gathered may include document characteristics


similar to what is used in text mining, but it may include additional
concepts such as the document hierarchy

 Web content mining can also be used to enhance the results


produced by search engines

 In addition to text, Web pages also contain hyperlinks pointing one


page to another.

 Hyperlinks contain a significant amount of hidden human


annotation

 When a Web page developer includes a link pointing to another


Web page, this can be regarded as the developer’s endorsement
of the other page.

 Therefore, the vast amount of Web linkage information provides a


rich collection of information about the relevance, quality, and
structure of the Web’s contents.

 A search on the Web to obtain information on a specific topic


usually returns a few relevant, high-quality Web pages and a larger
number of unusable Web pages.

 Use of an index based on authoritative will improve the search


results and ranking of relevant pages

 The idea of authority stems from earlier information retrieval work


using citations among journal articles to evaluate the impact of
research papers

 There are significant differences between the citations in research


articles and hyperlinks on Web pages:

 not every hyperlink represents an endorsement

 one authority will rarely have its Web page point to rival
authorities in the same domain

 authoritative pages are seldom particularly descriptive

60 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 The structure of Web hyperlinks has led to another important


category of Web pages called a hub.

 A hub is one or more Web pages that provide a collection of


links to authoritative pages.

 Hub pages provide link to a collection of prominent sites on a


specific topic of interest.

 A hub could be a list of recommended links on an individual’s


homepage, recommended reference sites on a course Web page

Hyperlink-induced topic search (HITS)

 HITS is a link analysis algorithm that rates Web pages using the
hyperlink information contained within them.

 The HITS algorithm collects a base document set for a specific


query. It then recursively calculates the hub and authority values
for each document.

 To gather the base document set, a root set that matches the
query is fetched from a search engine.

2.WEB STRUCTURE MINING

 It is the process of extracting useful information from the links


embedded in Web documents.

 It is used to identify authoritative pages and hubs.

 Just as links going to a Web page may indicate a site’s popularity


(or authority), links within the Web page (or the compete Web site)
may indicate the depth of coverage of a specific topic.

 Analysis of links is very important in understanding the


interrelationships among large numbers of Web pages, leading to
a better understanding of a specific Web community

61 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

3.WEB USAGE MINING

 Web usage mining is the extraction of useful information from data


generated through Web page visits and transactions.

 Three types of data are generated through Web page visits:

1. Automatically generated data stored in server access logs,


referrer logs, agent logs, and client-side cookies

2. User profiles

3. Metadata, such as page attributes, content attributes, and


usage data

 Analysis of the information collected by Web servers can help us


better understand user behavior. Analysis of this data is often
called click stream analysis.

 By using the data and text mining techniques, a company might be


able to determine interesting patterns.

 Click stream Analysis:

 Useful in determining where to place online advertisements.


 Click stream analysis might also be useful for knowing when
visitors access a site.

Process of extracting knowledge from clickstream data

62 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Applications of Web mining:

1. Determine the lifetime value of clients.

2. Design cross-marketing strategies across products.

3. Evaluate promotional campaigns.

4. Target electronic ads and coupons at user groups based on


user access patterns.

5. Predict user behavior based on previously learned rules and


users’ profiles.

6. Present dynamic information to users based on their interests


and profiles

Web usage mining software

SPATIAL DATA MINING

Spatial data mining is the process of discovering interesting, useful, non-


trivial patterns from large spatial datasets

63 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

64 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

65 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

66 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

67 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

68 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

69 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

70 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

71 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

72 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

73 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

PROCESS MINING

 ―The idea of process mining is to discover, monitor and improve


real processes (i.e., not assumed processes) by extracting
knowledge from event logs readily available in today’s
(information) systems.
 Process mining includes (automated) process discovery (i.e.,
extracting process models from an event log), conformance
checking (i.e., monitoring deviations by comparing model and log),
social network/organizational mining, automated construction of
simulation models, model extension, model repair, case prediction,
and history-based recommendations.‖

Events and event logs:

 It is assumed that an event refers to a process activity or a task,


which is a well-defined step in the process and is related to a
particular case, i.e. process instance.

 Another assumption is that these events are ordered.

 The case or process instance is a specific occurrence or execution


of a business process, while activity is an operation, part of a case,
that is being executed.

 An event log stores information about cases and activities, but also
information about event performers, event timestamps (moment
when the event is triggered) or data elements recorded with the
event

 Process mining activities such as extracting and filtering data from


information systems are not trivial.

 Data may be distributed over a variety of sources, event data may


be incomplete, an event log may contain outliers, logs may contain
events at different level of granularity, etc.

 Process Mining Manifesto gives following guidelines referring to


the event data:

 events should be trustworthy,

74 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 event logs should be complete,

 any recorded event should have well-defined semantics

the event data should be safe

 Process mining types

Three process mining types:

discovery,

conformance and

enhancement.

1. Process discovery:

 A process discovery technique produces a process model from


an event log, without using any a-priori information about the
process and it is the most eminent process mining technique.

2. Conformance

 Conformance compares an existing process model with an


event log of the same process

 It is used to check if reality, as recorded in the log, conforms to


the model and vice versa.

 Conformance checking can be used to:

 check the quality of documented processes (asses


whether they describe reality accurately);

 to identify deviating cases and understand what they have


in common; for auditing purposes;

 to judge the quality of a discovered process model

3. Enhancement :

 Enhancement extends or improves an existing process model


using information about the actual process recorded in event
log, with the aim of changing or extending the a-priori model.

75 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 For instance, by using time stamps in the event log one can
extend the model to show bottlenecks, service levels,
throughput times and frequencies

Process mining software tools and techniques:

 Many contemporary process mining software tools were developed


and are continuously improved, such as: (Celonis Gmbh), Disco
(Fluxicon), EDS (StereoLOGIC Ltd), Fujitsu (Fujitsu Ltd) Icaro
(Icaro Tech), Icris (Icris), LANA (Lana Labs), Minit (Gradient ECM),
myInvenio (Cognitive Technology), ProcessGold (Processgold
International B.V.), ProM (Open Source, hosted at TU/e), ProM
Lite (Open Source hosted at TU/e), QPR (QPR), RapidProM
(Open Source hosted at TU/e), Rialto (Exeura), SNP (SNP
Schneider-Neureither & Partner AG), ARIS PPM ( Software AG).

 Currently, the most prominent, open-source tool is ProM (Process


Mining Framework) , as it offers a variety of plug-ins that enable
application of various algorithms and latest developments in
process mining research.

 Three main categories of process mining algorithms:

 Deterministic algorithms,

 Heuristic algorithms and

 Genetic algorithms.

 Deterministic algorithms always generate repeatable models, as


all of the data has to be known and process mining output is
constant for the given input of variables

1.2. BUSINESS INTELLIGENCE PROCESS

What is Business Intelligence?

BI(Business Intelligence) is a set of processes, architectures, and


technologies that convert raw data into meaningful information that
drives profitable business actions.

 It is a suite of software and services to transform data into


actionable intelligence and knowledge.
76 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 BI has a direct impact on organization's strategic, tactical and


operational business decisions.

 BI supports fact-based decision making using historical data


rather than assumptions and gut feeling.

 BI tools perform data analysis and create reports, summaries,


dashboards, maps, graphs, and charts to provide users with
detailed intelligence about the nature of the business.

Why is BI important?

 Measurement: creating KPI (Key Performance Indicators) based


on historic data

 Identify and set benchmarks for varied processes.

 With BI systems organizations can identify market trends and


spot business problems that need to be addressed.

 BI helps on data visualization that enhances the data quality and


thereby the quality of decision making.

 BI systems can be used not just by enterprises but SME (Small


and Medium Enterprises)

How Business Intelligence systems are implemented?

Here are the steps:

Step 1:

Raw Data from corporate databases is extracted. The data could


be spread across multiple systems heterogeneous systems.

Step 2:

The data is cleaned and transformed into the data warehouse. The
table can be linked, and data cubes are formed.

Step 3:

Using BI system the user can ask quires, request ad-hoc reports
or conduct any other analysis

77 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Examples of Business Intelligence System used in Practice

 In an Online Transaction Processing (OLTP) system information


that could be fed into product database could be
 add a product line
 change a product price
 Correspondingly, in a Business Intelligence system query that
would be executed for the product subject area could be did the
addition of new product line or change in product price increase
revenues
 In an advertising database of OLTP system query that could be
executed
 Changed in advertisement options
 Increase radio budget

Four types of BI users

Following given are the four key players who are used Business
Intelligence System:

1. The Professional Data Analyst:

78 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

The data analyst is a statistician who always needs to drill deep


down into data. BI system helps them to get fresh insights to develop
unique business strategies.

2. The IT users:

The IT user also plays a dominant role in maintaining the BI


infrastructure.

3. The head of the company:

CEO or CXO can increase the profit of their business by improving


operational efficiency in their business.

4. The Business Users:

 Business intelligence users can be found from across the


organization. There are mainly two types of business users
Casual business intelligence user

 The power user.

Advantages of Business Intelligence

Here are some of the advantages of using Business Intelligence System:

1. Boost productivity

With a BI program, It is possible for businesses to create reports


with a single click thus saves lots of time and resources. It also allows
employees to be more productive on their tasks.

2. To improve visibility

BI also helps to improve the visibility of these processes and make


it possible to identify any areas which need attention.

3. Fix Accountability :

BI system assigns accountability in the organization as there must


be someone who should own accountability and ownership for the
organization's performance against its set goals.

79 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. It gives a bird's eye view:

BI system also helps organizations as decision makers get an


overall bird's eye view through typical BI features like dashboards and
scorecards.

5. It streamlines business processes:

BI takes out all complexity associated with business processes. It


also automates analytics by offering predictive analysis, computer
modeling, benchmarking and other methodologies.

6. It allows for easy analytics:

BI software has democratized its usage, allowing even


nontechnical or non-analysts users to collect and process data quickly.
This also allows putting the power of analytics from the hand's many
people.

BI System Disadvantages

1. Cost:

Business intelligence can prove costly for small as well as for medium-
sized enterprises. The use of such type of system may be expensive for
routine business transactions.

2. Complexity:

Another drawback of BI is its complexity in implementation of


datawarehouse. It can be so complex that it can make business
techniques rigid to deal with.

3. Limited use

Like all improved technologies, BI was first established keeping in


consideration the buying competence of rich firms. Therefore, BI system
is yet not affordable for many small and medium size companies.

4. Time Consuming Implementation

It takes almost one and half year for data warehousing system to be
completely implemented. Therefore, it is a time-consuming process.

80 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE


GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Trends in Business Intelligence

Artificial Intelligence:

Gartner' report indicates that AI and machine learning now take on


complex tasks done by human intelligence. This capability is being
leveraged to come up with real-time data analysis and dashboard
reporting.

Collaborative BI:

BI software combined with collaboration tools, including social


media, and other latest technologies enhance the working and sharing
by teams for collaborative decision making.

Embedded BI:

Embedded BI allows the integration of BI software or some of its


features into another business application for enhancing and extending
it's reporting functionality.

Cloud Analytics:

BI applications will be soon offered in the cloud, and more


businesses will be shifting to this technology. As per their predictions
within a couple of years, the spending on cloud-based analytics will grow
4.5 times faster.

Prepared by,

D.DURAI KUMAR,

Head Of the Department,

Department Of Information Technology,

GTEC.

81 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

You might also like