0% found this document useful (0 votes)

92 views28 pages

Business Intelligence - Chapter 4

Data mining involves extracting useful patterns from large data sets. It is a multidisciplinary field that uses techniques from databases, statistics, and artificial intelligence. Data must go through cleaning and preparation before mining to ensure high quality. The data mining process involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Data mining can benefit organizations by enabling automated decision making, accurate prediction and forecasting, and increased revenue through improved customer relationships.

Uploaded by

OSAMA MASHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views28 pages

Business Intelligence - Chapter 4

Uploaded by

OSAMA MASHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

BUSINESS INTELLIGENCE

Dr. Nasim AbdulWahab Matar

Dr. Firas Yousef Omar
CHAPTER 4

Data Mining

2
DATA MINING
Data mining is the art and science of discovering knowledge, insights, and
patterns in data. It is the act of extracting useful patterns from an
organized collection of data. Patterns must be valid, novel, potentially
useful, and understandable. The implicit assumption is that data about the
past can reveal patterns of activity that can be projected into the future.

3
Dr. Nasim AbdulWahab Matar
DATA MINING DISCIPLIN

Data mining is a multidisciplinary field that borrows

techniques from a variety of fields.
1. It utilizes the knowledge of data quality and data
organizing from the databases area.
2. It draws modeling and analytical techniques from
statistics and computer science (artificial
intelligence) areas.
3. It also draws the knowledge of decision-making
from the field of business management.

4
DATA MINING EXAMPLE
For example, “customers who buy cheese and milk also buy bread 90
percent of the time” would be a useful pattern for a grocery store, which
can then stock the products appropriately. Similarly, “people with blood
pressure greater than 160 and an age greater than 65 were at a high risk of
dying from a heart stroke” is of great diagnostic value for doctors, who can
then focus on treating such patients with urgent care and great sensitivity.
Past data can be of predictive value in many complex situations, especially
where the pattern may not be so easily visible without the modeling
technique. Here

5
GATHERING AND SELECTING DATA
The total amount of data in the world is doubling every 18 months. There is an ever-
growing avalanche of data coming with higher velocity, volume, and variety. One has to
quickly use it or lose it. Smart data mining requires choosing where to play. One has to make
judicious decisions about what to gather and what to ignore, based on the purpose of the
data mining exercises. It is like deciding where to fish; not all streams of data will be equally
rich in potential insights.

6
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

Big Data includes huge valume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.
Structured data –
Structured data is a data whose elements are
addressable for effective analysis. It has been
organized into a formatted repository that is
typically a database. It concern all data which can
be stored in database SQL in table with rows and
columns. They have relational key and can easily
mapped into pre-designed fields. Today, those data
are most processed in development and simplest
way to manage information. Example: Relational
data.
7
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

semi-structured data:
Semi-structured data is information that does not reside in a rational database but
that have some organizational properties that make it easier to analyze. With some
process, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space. Example:
XML data.

8
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

Unstructured data –
Unstructured data is a data that is which is not organized in a pre-defined manner or
does not have a pre-defined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.

9
Dr. Nasim AbdulWahab Matar
DATA CLEANSING AND PREPARATION
The quality of data is critical to the success and value of the data mining
project. Otherwise, the situation will be of the kind of garbage in and
garbage out (GIGO). The quality of incoming data varies by the source and
nature of data. Data from internal operations is likely to be of higher
quality, as it will be accurate and consistent. Data from social media and
other public sources is less under the control of business, and is less likely
to be reliable.

10
DATA CLEANSING AND PREPARATION
1. Duplicate data needs to be removed. The same data may be received from multiple sources. When
merging the data sets, data must be de-duped.
2. Missing values need to be filled in, or those rows should be removed from analysis. Missing values
can be filled in with average or modal or default values.
3. Data elements may need to be transformed from one unit to another. For example, total costs of
health care and the total number of patients may need to be reduced to cost/patient to allow
comparability of that value.
4. Continuous values may need to be binned into a few buckets to help with some analyses. For
example, work experience could be binned as low, medium, and high.
5. Data elements may need to be adjusted to make them comparable over time. For example,
currency values may need to be adjusted for inflation; they would need to be converted to the
same base year for comparability. They may need to be converted to a common currency.
6. Outlier data elements need to be removed after careful review, to avoid the skewing of
results. For example, one big donor could skew the analysis of alumni donors in an
educational setting.
11
DATA CLEANSING AND PREPARATION
7. Any biases in the selection of data should be corrected to ensure the data is
representative of the phenomena under analysis. If the data includes many more
members of one gender than is typical of the population of interest, then adjustments
need to be applied to the data.
8. Data should be brought to the same granularity to ensure comparability. Sales data
may be available daily, but the sales person compensation data may only be available
monthly. To relate these variables, the data must be brought to the lowest common
denominator, in this case, monthly.
9. Data may need to be selected to increase information density. Some data may not
show much variability, because it was not properly recorded or for any other reasons.
This data may dull the effects of other differences in the data and should be removed to
improve the information density of the data.

12
HOW TO DO DATA MINING
The accepted data mining process involves six steps:
1. Business understanding
The first step is establishing the goals of the project are and how data mining can help you reach that goal. A plan should be
developed at this stage to include timelines, actions, and role assignments.
2. Data understanding
Data is collected from all applicable data sources in this step. Data visualization tools are often used in this stage to explore the
properties of the data to ensure it will help achieve the business goals.
3. Data preparation
Data is then cleansed, and missing data is included to ensure it is ready to be mined. Data processing can take enormous
amounts of time depending on the amount of data analyzed and the number of data sources. Therefore, distributed systems
are used in modern database management systems (DBMS) to improve the speed of the data mining process rather than
burden a single system. They’re also more secure than having all an organization’s data in a single data warehouse. It’s
important to include failsafe measures in the data manipulation stage so data is not permanently lost.
4. Data Modeling
Mathematical models are then used to find patterns in the data using sophisticated data tools.
5. Evaluation
The findings are evaluated and compared to business objectives to determine if they should be deployed across the
organization.
6. Deployment
In the final stage, the data mining findings are shared across everyday business operations. An enterprise business intelligence13
platform can be used to provide a single source of the truth for self-service data discovery
HOW TO DO DATA MINING

14
BENEFITS OF DATA MINING

1. Automated Decision-Making
Data Mining allows organizations to continually analyze data and automate both routine and critical
decisions without the delay of human judgment. Banks can instantly detect fraudulent transactions,
request verification, and even secure personal information to protect customers against identity theft.
Deployed within a firm’s operational algorithms, these models can collect, analyze, and act on data
independently to streamline decision making and enhance the daily processes of an organization.

2. Accurate Prediction and Forecasting

Planning is a critical process within every organization. Data mining facilitates planning and provides
managers with reliable forecasts based on past trends and current conditions. Macy’s implements
demand forecasting models to predict the demand for each clothing category at each store and route
the appropriate inventory to efficiently meet the market’s needs.

15
BENEFITS OF DATA MINING
3. Cost Reduction
Data mining allows for more efficient use and allocation of resources. Organizations can plan and
make automated decisions with accurate forecasts that will result in maximum cost
reduction. Delta imbedded RFID chips in passengers checked baggage and deployed data mining
models to identify holes in their process and reduce the number of bags mishandled. This process
improvement increases passenger satisfaction and decreases the cost of searching for and re-routing
lost baggage.

4. Customer Insights
Firms deploy data mining models from customer data to uncover key characteristics and differences
among their customers. Data mining can be used to create personas and personalize each touchpoint
to improve overall customer experience. In 2017, Disney invested over one billion dollars to create
and implement “Magic Bands.” These bands have a symbiotic relationship with consumers, working to
increase their overall experience at the resort while simultaneously collecting data on their activities
for Disney to analyze to further enhance their customer experience.
16
CHALLENGES OF DATA MINING - BIG DATA
Big Data
The challenges of big data are prolific and penetrate every field that collects, stores, and analyzes data. Big data
is characterized by four major challenges: volume, variety, veracity, and velocity. The goal of data mining is to
mediate these challenges and unlock the data’s value.
• Volume describes the challenge of storing and processing the enormous quantity of data collected by
organizations. This enormous amount of data presents two major challenges: first, it is more difficult to find
the correct data, and second, it slows down the processing speed of data mining tools.
• Variety encompasses the many different types of data collected and stored. Data mining tools must be
equipped to simultaneously process a wide array of data formats. Failing to focus an analysis on both
structured and unstructured data inhibits the value added by data mining.
• Velocity details the increasing speed at which new data is created, collected, and stored. While volume refers
to increasing storage requirement and variety refers to the increasing types of data, velocity is the challenge
associated with the rapidly increasing rate of data generation.
• Finally, veracity acknowledges that not all data is equally accurate. Data can be messy, incomplete,
improperly collected, and even biased. With anything, the quicker data is collected, the more errors will
manifest within the data. The challenge of veracity is to balance the quantity of data with its quality.
17
CHALLENGES OF DATA MINING - OVER-FITTING MODELS

Over-Fitting Models
Over-fitting occurs when a model explains the natural errors within the sample instead of the underlying
trends of the population. Over-fitted models are often overly complex and utilize an excess of independent
variables to generate a prediction. Therefore, the risk of over-fitting is heighted by the increase in volume
and variety of data. Too few variables make the model irrelevant, where as too many variables restrict the
model to the known sample data. The challenge is to moderate the number of variables used in data mining
models and balance its predictive power with accuracy.

18
CHALLENGES OF DATA MINING - COST OF SCALE
Cost of Scale
As data velocity continues to increase data’s volume and variety, firms must scale these models and apply
them across the entire organization. Unlocking the full benefits of data mining with these models requires
significant investment in computing infrastructure and processing power. To reach scale, organizations must
purchase and maintain powerful computers, servers, and software designed to handle the firm’s large
quantity and variety of data.

19
CHALLENGES OF DATA MINING - PRIVACY AND SECURITY
Privacy and Security

The increased storage requirement of data has forced many firms to turn toward cloud computing and
storage. While the cloud has empowered many modern advances in data mining, the nature of the service
creates significant privacy and security threats. Organizations must protect their data from malicious
figures to maintain the trust of their partners and customers.

With data privacy comes the need for organizations to develop internal rules and constraints on the use
and implementation of a customer’s data. Data mining is a powerful tool that provides businesses with
compelling insights into their consumers. However, at what point do these insights infringe on an
individual’s privacy? Organizations must weigh this relationship with their customers, develop policies to
benefit consumers, and communicate these policies to the consumers to maintain a trustworthy
relationship.

20
TYPES OF DATA MINING
Data mining has two primary processes: supervised and unsupervised learning.

Supervised Learning
The goal of supervised learning is prediction or classification. The easiest way to conceptualize this process is
to look for a single output variable. A process is considered supervised learning if the goal of the model is to
predict the value of an observation. One example is spam filters, which use supervised learning to classify
incoming emails as unwanted content and automatically remove these messages from your inbox.
Common analytical models used in supervised data mining approaches are:

1. Linear Regressions
Linear regressions predict the value of a continuous variable using one or more independent inputs. Realtors
use linear regressions to predict the value of a house based on square footage, bed-to-bath ratio, year built,
and zip code.

21
TYPES OF DATA MINING - SUPERVISED LEARNING
2. Logistic Regressions
Logistic regressions predict the probability of a categorical variable using one or more independent inputs.
Banks use logistic regressions to predict the probability that a loan applicant will default based on credit
score, household income, age, and other personal factors.

3. Time Series
Time series models are forecasting tools which use time as the primary independent variable. Retailers,
such as Macy’s, deploy time series models to predict the demand for products as a function of time and use
the forecast to accurately plan and stock stores with the required level of inventory.

4. Classification or Regression Trees

Classification Trees are a predictive modeling technique that can be used to predict the value of both
categorical and continuous target variables. Based on the data, the model will create sets of binary rules to
split and group the highest proportion of similar target variables together. Following those rules, the group
that a new observation falls into will become its predicted value.

22
TYPES OF DATA MINING - SUPERVISED LEARNING
5.Neural Networks
- A neural network is an analytical model inspired by the structure of the brain, its neurons, and their connections.
These models were originally created in 1940s but have just recently gained popularity with statisticians and data
scientists. Neural networks use inputs and, based on their magnitude, will “fire” or “not fire” its node based on its
threshold requirement. This signal, or lack thereof, is then combined with the other “fired” signals in the hidden
layers of the network, where the process repeats itself until an output is created. Since one of the benefits of
neural networks is a near-instant output, self-driving cars are deploying these models to accurately and efficiently
process data to autonomously make critical decisions.

6. K-Nearest Neighbor
The K-nearest neighbor method is used to categorize a new observation based on past observations. Unlike the
previous methods, k-nearest neighbor is data-driven, not model-driven. This method makes no underlying
assumptions about the data nor does it employ complex processes to interpret its inputs. The basic idea of the k-
nearest neighbor model is that it classifies new observations by identifying its closest K neighbors and assigning
it the majority’s value. Many recommender systems nest this method to identify and classify similar content which
will later be pulled by the greater algorithm.

23
TYPES OF DATA MINING -UNSUPERVISED LEARNING
Unsupervised Learning
Unsupervised tasks focus on understanding and describing data to reveal underlying patterns within it.
Recommendation systems employ unsupervised learning to track user patterns and provide them with
personalized recommendations to enhance their customer experience.
Common analytical models used in unsupervised data mining approaches are:

1. Clustering
Clustering models group similar data together. They are best employed with complex data sets describing a
single entity. One example is lookalike modeling, to group similarities between segments, identify clusters,
and target new groups who look like an existing group.

24
TYPES OF DATA MINING - UNSUPERVISED LEARNING
2. Association Analysis
Association analysis is also known as market basket analysis and is used to identify items that frequently
occur together. Supermarkets commonly use this tool to identify paired products and spread them out in the
store to encourage customers to pass by more merchandise and increase their purchases.

3. Principal Component Analysis

Principal component analysis is used to illustrate hidden correlations between input variables and create new
variables, called principal components, which capture the same information contained in the original data,
but with less variables. By reducing the number of variables used to convey the same level information,
analysts can increase the utility and accuracy of supervised data mining models.

25
UNIVERSITY HEALTH SYSTEM
Tools and Platforms for Data Mining
Data mining tools have existed for many decades. However, they have recently become more important as the
values of data have grown and the field of big data analytics has come into prominence. There are a wide range of
data mining platforms available in the market today.

• MS Excel is a relatively simple and easy data mining tool. It can get quite versatile once
analyst pack and some other add-on products are installed on it.
• IBM’s SPSS Modeler is an industry-leading data mining platform. It offers a powerful set
of tools and algorithms for most popular data mining capabilities. It has colorful GUI
format with drag-and-drop capabilities. It can accept data in multiple formats, including
reading Excel files directly.
• Weka is an open-source GUI-based tool that offers a large number of data mining
algorithms.
• ERP systems include some data analytic capabilities, too. SAP has its Business Objects
BI software. Business Objects is considered one of the leading BI suites in the industry
and is often used by organizations that use SAP. 26
HOMEWORK

27
THANK
YOU
Dr. Firas Yousef Omar

[email protected]

EXT: 9400

Textbook
No ratings yet
Textbook
100 pages
Data Mining
No ratings yet
Data Mining
395 pages
Data Mining - Intro
No ratings yet
Data Mining - Intro
17 pages
Hu DM 2024
No ratings yet
Hu DM 2024
205 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Business Intelligence - Chapter 1
No ratings yet
Business Intelligence - Chapter 1
36 pages
Fundamentals of Datascience
No ratings yet
Fundamentals of Datascience
81 pages
Course Manual On Data Mining - CSC 425 - 015446
No ratings yet
Course Manual On Data Mining - CSC 425 - 015446
44 pages
Full and Correct Notes For FDS-6th Bca
No ratings yet
Full and Correct Notes For FDS-6th Bca
83 pages
Data Mining
No ratings yet
Data Mining
11 pages
Data Mining Information
No ratings yet
Data Mining Information
7 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Data Mining Process, Techniques, Tools & Examples
No ratings yet
Data Mining Process, Techniques, Tools & Examples
11 pages
Unit-2 Bi
No ratings yet
Unit-2 Bi
58 pages
Unit 3
No ratings yet
Unit 3
22 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
HND - BI - W8 - Data Mining
No ratings yet
HND - BI - W8 - Data Mining
19 pages
Topic 3 - Data Mining
No ratings yet
Topic 3 - Data Mining
37 pages
Fundamentals of Datascience1
No ratings yet
Fundamentals of Datascience1
83 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Data Mining
No ratings yet
Data Mining
18 pages
Unit III DWDM
No ratings yet
Unit III DWDM
113 pages
Dadm (1) Sidra
No ratings yet
Dadm (1) Sidra
9 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Introduction
No ratings yet
Introduction
46 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
Data Mining
No ratings yet
Data Mining
46 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
UNIT-1 Introduction: Motivation: Why Data Mining?
No ratings yet
UNIT-1 Introduction: Motivation: Why Data Mining?
86 pages
DM Mod 1
No ratings yet
DM Mod 1
17 pages
Dataminig
No ratings yet
Dataminig
21 pages
Unit 3 Ba
No ratings yet
Unit 3 Ba
29 pages
Presentation On Data Mining
100% (1)
Presentation On Data Mining
51 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Data Science Module 1 Notes
No ratings yet
Data Science Module 1 Notes
16 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Data Mining
No ratings yet
Data Mining
16 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Data Mine
No ratings yet
Data Mine
14 pages
Data Mining
No ratings yet
Data Mining
7 pages
Unit - I
No ratings yet
Unit - I
22 pages
Business Analytics
100% (5)
Business Analytics
46 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
No ratings yet
Devi Ahilya Vishwavidyalaya, Indore: Session 2018 - 19
5 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data
No ratings yet
Data
9 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Fundamentals of Datascience
No ratings yet
Fundamentals of Datascience
80 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Data Mining Overview
No ratings yet
Data Mining Overview
14 pages
DEA 1 88 - No
No ratings yet
DEA 1 88 - No
19 pages
Rank Correlation
No ratings yet
Rank Correlation
19 pages
MTPDF1 - Introduction To Statistics
No ratings yet
MTPDF1 - Introduction To Statistics
106 pages
Data Analytics: Get Certified! Get Hired!
No ratings yet
Data Analytics: Get Certified! Get Hired!
38 pages
Python Data Science Readthedocs Io en Latest
No ratings yet
Python Data Science Readthedocs Io en Latest
196 pages
10-Personnel Cost Planning
No ratings yet
10-Personnel Cost Planning
52 pages
Res 03: Experimental Research: By: George T. Cadungon JR
No ratings yet
Res 03: Experimental Research: By: George T. Cadungon JR
32 pages
MLQuestion Bank
No ratings yet
MLQuestion Bank
2 pages
Content
No ratings yet
Content
36 pages
Quiz 2
100% (1)
Quiz 2
9 pages
Stevenson CH03 Accessible
No ratings yet
Stevenson CH03 Accessible
51 pages
05 Kaggle Competition
No ratings yet
05 Kaggle Competition
37 pages
Factors Associated With Church Membership Retention
No ratings yet
Factors Associated With Church Membership Retention
203 pages
Week 3 Session 6 BEO6000 PPT VU Format (Update)
No ratings yet
Week 3 Session 6 BEO6000 PPT VU Format (Update)
46 pages
Unit-2 (Data Litrecy)
No ratings yet
Unit-2 (Data Litrecy)
7 pages
NguyenTriDan - AI Engineering Intern - CV
No ratings yet
NguyenTriDan - AI Engineering Intern - CV
1 page
B2B Industrial Marketing Intelligence (Unit 4)
100% (1)
B2B Industrial Marketing Intelligence (Unit 4)
17 pages
Machine Learning For Data Science Using Python 2022
No ratings yet
Machine Learning For Data Science Using Python 2022
2 pages
2 Problem Identification
No ratings yet
2 Problem Identification
62 pages
Data Portfolio Project Questions
No ratings yet
Data Portfolio Project Questions
11 pages
How To Become A Parssu Member Updated - 2019
0% (1)
How To Become A Parssu Member Updated - 2019
2 pages
21 K-Nearest Neighbors Regression
No ratings yet
21 K-Nearest Neighbors Regression
8 pages
Assessment L3 SHRM Component B CW1 2020-21-2
No ratings yet
Assessment L3 SHRM Component B CW1 2020-21-2
8 pages
Presentation AI
No ratings yet
Presentation AI
11 pages
Data!
No ratings yet
Data!
19 pages
BDA - Assignment 2
No ratings yet
BDA - Assignment 2
2 pages
PA Lab2
No ratings yet
PA Lab2
11 pages
DDA1730308 Uber Case Study Presentation
No ratings yet
DDA1730308 Uber Case Study Presentation
12 pages
Analisis Regresi Sederhana - Kelompok1 - Pspa23 B 2
No ratings yet
Analisis Regresi Sederhana - Kelompok1 - Pspa23 B 2
4 pages

Business Intelligence - Chapter 4

Uploaded by

Business Intelligence - Chapter 4

Uploaded by

BUSINESS INTELLIGENCE

Dr. Nasim AbdulWahab Matar

Data mining is a multidisciplinary field that borrows

2. Accurate Prediction and Forecasting

4. Classification or Regression Trees

3. Principal Component Analysis

You might also like