0% found this document useful (0 votes)
92 views28 pages

Business Intelligence - Chapter 4

Data mining involves extracting useful patterns from large data sets. It is a multidisciplinary field that uses techniques from databases, statistics, and artificial intelligence. Data must go through cleaning and preparation before mining to ensure high quality. The data mining process involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Data mining can benefit organizations by enabling automated decision making, accurate prediction and forecasting, and increased revenue through improved customer relationships.

Uploaded by

OSAMA MASHAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views28 pages

Business Intelligence - Chapter 4

Data mining involves extracting useful patterns from large data sets. It is a multidisciplinary field that uses techniques from databases, statistics, and artificial intelligence. Data must go through cleaning and preparation before mining to ensure high quality. The data mining process involves business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Data mining can benefit organizations by enabling automated decision making, accurate prediction and forecasting, and increased revenue through improved customer relationships.

Uploaded by

OSAMA MASHAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

BUSINESS INTELLIGENCE

Dr. Nasim AbdulWahab Matar


Dr. Firas Yousef Omar
CHAPTER 4

Data Mining

2
DATA MINING
Data mining is the art and science of discovering knowledge, insights, and
patterns in data. It is the act of extracting useful patterns from an
organized collection of data. Patterns must be valid, novel, potentially
useful, and understandable. The implicit assumption is that data about the
past can reveal patterns of activity that can be projected into the future.

3
Dr. Nasim AbdulWahab Matar
DATA MINING DISCIPLIN

Data mining is a multidisciplinary field that borrows


techniques from a variety of fields.
1. It utilizes the knowledge of data quality and data
organizing from the databases area.
2. It draws modeling and analytical techniques from
statistics and computer science (artificial
intelligence) areas.
3. It also draws the knowledge of decision-making
from the field of business management.

4
DATA MINING EXAMPLE
For example, “customers who buy cheese and milk also buy bread 90
percent of the time” would be a useful pattern for a grocery store, which
can then stock the products appropriately. Similarly, “people with blood
pressure greater than 160 and an age greater than 65 were at a high risk of
dying from a heart stroke” is of great diagnostic value for doctors, who can
then focus on treating such patients with urgent care and great sensitivity.
Past data can be of predictive value in many complex situations, especially
where the pattern may not be so easily visible without the modeling
technique. Here

5
GATHERING AND SELECTING DATA
The total amount of data in the world is doubling every 18 months. There is an ever-
growing avalanche of data coming with higher velocity, volume, and variety. One has to
quickly use it or lose it. Smart data mining requires choosing where to play. One has to make
judicious decisions about what to gather and what to ignore, based on the purpose of the
data mining exercises. It is like deciding where to fish; not all streams of data will be equally
rich in potential insights.

6
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

Big Data includes huge valume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.
Structured data –
Structured data is a data whose elements are
addressable for effective analysis. It has been
organized into a formatted repository that is
typically a database. It concern all data which can
be stored in database SQL in table with rows and
columns. They have relational key and can easily
mapped into pre-designed fields. Today, those data
are most processed in development and simplest
way to manage information. Example: Relational
data.
7
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

semi-structured data:
Semi-structured data is information that does not reside in a rational database but
that have some organizational properties that make it easier to analyze. With some
process, you can store them in the relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured exist to ease space. Example:
XML data.

8
Dr. Nasim AbdulWahab Matar
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND
UNSTRUCTURED DATA

Unstructured data –
Unstructured data is a data that is which is not organized in a pre-defined manner or
does not have a pre-defined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.

9
Dr. Nasim AbdulWahab Matar
DATA CLEANSING AND PREPARATION
The quality of data is critical to the success and value of the data mining
project. Otherwise, the situation will be of the kind of garbage in and
garbage out (GIGO). The quality of incoming data varies by the source and
nature of data. Data from internal operations is likely to be of higher
quality, as it will be accurate and consistent. Data from social media and
other public sources is less under the control of business, and is less likely
to be reliable.

10
DATA CLEANSING AND PREPARATION
1. Duplicate data needs to be removed. The same data may be received from multiple sources. When
merging the data sets, data must be de-duped.
2. Missing values need to be filled in, or those rows should be removed from analysis. Missing values
can be filled in with average or modal or default values.
3. Data elements may need to be transformed from one unit to another. For example, total costs of
health care and the total number of patients may need to be reduced to cost/patient to allow
comparability of that value.
4. Continuous values may need to be binned into a few buckets to help with some analyses. For
example, work experience could be binned as low, medium, and high.
5. Data elements may need to be adjusted to make them comparable over time. For example,
currency values may need to be adjusted for inflation; they would need to be converted to the
same base year for comparability. They may need to be converted to a common currency.
6. Outlier data elements need to be removed after careful review, to avoid the skewing of
results. For example, one big donor could skew the analysis of alumni donors in an
educational setting.
11
DATA CLEANSING AND PREPARATION
7. Any biases in the selection of data should be corrected to ensure the data is
representative of the phenomena under analysis. If the data includes many more
members of one gender than is typical of the population of interest, then adjustments
need to be applied to the data.
8. Data should be brought to the same granularity to ensure comparability. Sales data
may be available daily, but the sales person compensation data may only be available
monthly. To relate these variables, the data must be brought to the lowest common
denominator, in this case, monthly.
9. Data may need to be selected to increase information density. Some data may not
show much variability, because it was not properly recorded or for any other reasons.
This data may dull the effects of other differences in the data and should be removed to
improve the information density of the data.

12
HOW TO DO DATA MINING
The accepted data mining process involves six steps:
1. Business understanding
The first step is establishing the goals of the project are and how data mining can help you reach that goal. A plan should be
developed at this stage to include timelines, actions, and role assignments.
2. Data understanding
Data is collected from all applicable data sources in this step. Data visualization tools are often used in this stage to explore the
properties of the data to ensure it will help achieve the business goals.
3. Data preparation
Data is then cleansed, and missing data is included to ensure it is ready to be mined. Data processing can take enormous
amounts of time depending on the amount of data analyzed and the number of data sources. Therefore, distributed systems
are used in modern database management systems (DBMS) to improve the speed of the data mining process rather than
burden a single system. They’re also more secure than having all an organization’s data in a single data warehouse. It’s
important to include failsafe measures in the data manipulation stage so data is not permanently lost.
4. Data Modeling
Mathematical models are then used to find patterns in the data using sophisticated data tools.
5. Evaluation
The findings are evaluated and compared to business objectives to determine if they should be deployed across the
organization.
6. Deployment
In the final stage, the data mining findings are shared across everyday business operations. An enterprise business intelligence13
platform can be used to provide a single source of the truth for self-service data discovery
HOW TO DO DATA MINING

14
BENEFITS OF DATA MINING

1. Automated Decision-Making
Data Mining allows organizations to continually analyze data and automate both routine and critical
decisions without the delay of human judgment. Banks can instantly detect fraudulent transactions,
request verification, and even secure personal information to protect customers against identity theft.
Deployed within a firm’s operational algorithms, these models can collect, analyze, and act on data
independently to streamline decision making and enhance the daily processes of an organization.

2. Accurate Prediction and Forecasting


Planning is a critical process within every organization. Data mining facilitates planning and provides
managers with reliable forecasts based on past trends and current conditions. Macy’s implements
demand forecasting models to predict the demand for each clothing category at each store and route
the appropriate inventory to efficiently meet the market’s needs.

15
BENEFITS OF DATA MINING
3. Cost Reduction
Data mining allows for more efficient use and allocation of resources. Organizations can plan and
make automated decisions with accurate forecasts that will result in maximum cost
reduction. Delta imbedded RFID chips in passengers checked baggage and deployed data mining
models to identify holes in their process and reduce the number of bags mishandled. This process
improvement increases passenger satisfaction and decreases the cost of searching for and re-routing
lost baggage.

4. Customer Insights
Firms deploy data mining models from customer data to uncover key characteristics and differences
among their customers. Data mining can be used to create personas and personalize each touchpoint
to improve overall customer experience. In 2017, Disney invested over one billion dollars to create
and implement “Magic Bands.” These bands have a symbiotic relationship with consumers, working to
increase their overall experience at the resort while simultaneously collecting data on their activities
for Disney to analyze to further enhance their customer experience.
16
CHALLENGES OF DATA MINING - BIG DATA
Big Data
The challenges of big data are prolific and penetrate every field that collects, stores, and analyzes data. Big data
is characterized by four major challenges: volume, variety, veracity, and velocity. The goal of data mining is to
mediate these challenges and unlock the data’s value.
• Volume describes the challenge of storing and processing the enormous quantity of data collected by
organizations. This enormous amount of data presents two major challenges: first, it is more difficult to find
the correct data, and second, it slows down the processing speed of data mining tools.
• Variety encompasses the many different types of data collected and stored. Data mining tools must be
equipped to simultaneously process a wide array of data formats. Failing to focus an analysis on both
structured and unstructured data inhibits the value added by data mining.
• Velocity details the increasing speed at which new data is created, collected, and stored. While volume refers
to increasing storage requirement and variety refers to the increasing types of data, velocity is the challenge
associated with the rapidly increasing rate of data generation.
• Finally, veracity acknowledges that not all data is equally accurate. Data can be messy, incomplete,
improperly collected, and even biased. With anything, the quicker data is collected, the more errors will
manifest within the data. The challenge of veracity is to balance the quantity of data with its quality.
17
CHALLENGES OF DATA MINING - OVER-FITTING MODELS

Over-Fitting Models
Over-fitting occurs when a model explains the natural errors within the sample instead of the underlying
trends of the population. Over-fitted models are often overly complex and utilize an excess of independent
variables to generate a prediction. Therefore, the risk of over-fitting is heighted by the increase in volume
and variety of data. Too few variables make the model irrelevant, where as too many variables restrict the
model to the known sample data. The challenge is to moderate the number of variables used in data mining
models and balance its predictive power with accuracy.

18
CHALLENGES OF DATA MINING - COST OF SCALE
Cost of Scale
As data velocity continues to increase data’s volume and variety, firms must scale these models and apply
them across the entire organization. Unlocking the full benefits of data mining with these models requires
significant investment in computing infrastructure and processing power. To reach scale, organizations must
purchase and maintain powerful computers, servers, and software designed to handle the firm’s large
quantity and variety of data.

19
CHALLENGES OF DATA MINING - PRIVACY AND SECURITY
Privacy and Security

The increased storage requirement of data has forced many firms to turn toward cloud computing and
storage. While the cloud has empowered many modern advances in data mining, the nature of the service
creates significant privacy and security threats. Organizations must protect their data from malicious
figures to maintain the trust of their partners and customers.

With data privacy comes the need for organizations to develop internal rules and constraints on the use
and implementation of a customer’s data. Data mining is a powerful tool that provides businesses with
compelling insights into their consumers. However, at what point do these insights infringe on an
individual’s privacy? Organizations must weigh this relationship with their customers, develop policies to
benefit consumers, and communicate these policies to the consumers to maintain a trustworthy
relationship.

20
TYPES OF DATA MINING
Data mining has two primary processes: supervised and unsupervised learning.

Supervised Learning
The goal of supervised learning is prediction or classification. The easiest way to conceptualize this process is
to look for a single output variable. A process is considered supervised learning if the goal of the model is to
predict the value of an observation. One example is spam filters, which use supervised learning to classify
incoming emails as unwanted content and automatically remove these messages from your inbox.
Common analytical models used in supervised data mining approaches are:

1. Linear Regressions
Linear regressions predict the value of a continuous variable using one or more independent inputs. Realtors
use linear regressions to predict the value of a house based on square footage, bed-to-bath ratio, year built,
and zip code.

21
TYPES OF DATA MINING - SUPERVISED LEARNING
2. Logistic Regressions
Logistic regressions predict the probability of a categorical variable using one or more independent inputs.
Banks use logistic regressions to predict the probability that a loan applicant will default based on credit
score, household income, age, and other personal factors.

3. Time Series
Time series models are forecasting tools which use time as the primary independent variable. Retailers,
such as Macy’s, deploy time series models to predict the demand for products as a function of time and use
the forecast to accurately plan and stock stores with the required level of inventory.

4. Classification or Regression Trees


Classification Trees are a predictive modeling technique that can be used to predict the value of both
categorical and continuous target variables. Based on the data, the model will create sets of binary rules to
split and group the highest proportion of similar target variables together. Following those rules, the group
that a new observation falls into will become its predicted value.

22
TYPES OF DATA MINING - SUPERVISED LEARNING
5.Neural Networks
- A neural network is an analytical model inspired by the structure of the brain, its neurons, and their connections.
These models were originally created in 1940s but have just recently gained popularity with statisticians and data
scientists. Neural networks use inputs and, based on their magnitude, will “fire” or “not fire” its node based on its
threshold requirement. This signal, or lack thereof, is then combined with the other “fired” signals in the hidden
layers of the network, where the process repeats itself until an output is created. Since one of the benefits of
neural networks is a near-instant output, self-driving cars are deploying these models to accurately and efficiently
process data to autonomously make critical decisions.

6. K-Nearest Neighbor
The K-nearest neighbor method is used to categorize a new observation based on past observations. Unlike the
previous methods, k-nearest neighbor is data-driven, not model-driven. This method makes no underlying
assumptions about the data nor does it employ complex processes to interpret its inputs. The basic idea of the k-
nearest neighbor model is that it classifies new observations by identifying its closest K neighbors and assigning
it the majority’s value. Many recommender systems nest this method to identify and classify similar content which
will later be pulled by the greater algorithm.

23
TYPES OF DATA MINING -UNSUPERVISED LEARNING
Unsupervised Learning
Unsupervised tasks focus on understanding and describing data to reveal underlying patterns within it.
Recommendation systems employ unsupervised learning to track user patterns and provide them with
personalized recommendations to enhance their customer experience.
Common analytical models used in unsupervised data mining approaches are:

1. Clustering
Clustering models group similar data together. They are best employed with complex data sets describing a
single entity. One example is lookalike modeling, to group similarities between segments, identify clusters,
and target new groups who look like an existing group.

24
TYPES OF DATA MINING - UNSUPERVISED LEARNING
2. Association Analysis
Association analysis is also known as market basket analysis and is used to identify items that frequently
occur together. Supermarkets commonly use this tool to identify paired products and spread them out in the
store to encourage customers to pass by more merchandise and increase their purchases.

3. Principal Component Analysis


Principal component analysis is used to illustrate hidden correlations between input variables and create new
variables, called principal components, which capture the same information contained in the original data,
but with less variables. By reducing the number of variables used to convey the same level information,
analysts can increase the utility and accuracy of supervised data mining models.

25
UNIVERSITY HEALTH SYSTEM
Tools and Platforms for Data Mining
Data mining tools have existed for many decades. However, they have recently become more important as the
values of data have grown and the field of big data analytics has come into prominence. There are a wide range of
data mining platforms available in the market today.

• MS Excel is a relatively simple and easy data mining tool. It can get quite versatile once
analyst pack and some other add-on products are installed on it.
• IBM’s SPSS Modeler is an industry-leading data mining platform. It offers a powerful set
of tools and algorithms for most popular data mining capabilities. It has colorful GUI
format with drag-and-drop capabilities. It can accept data in multiple formats, including
reading Excel files directly.
• Weka is an open-source GUI-based tool that offers a large number of data mining
algorithms.
• ERP systems include some data analytic capabilities, too. SAP has its Business Objects
BI software. Business Objects is considered one of the leading BI suites in the industry
and is often used by organizations that use SAP. 26
HOMEWORK

27
THANK
YOU
Dr. Firas Yousef Omar

[email protected]

EXT: 9400

You might also like