0% found this document useful (0 votes)
12 views39 pages

3-Understanding Business Problems, Prediction Variables, Data Requirement-21-12-20

The document outlines the process of understanding business problems in predictive analytics, emphasizing the importance of domain knowledge and clear communication with stakeholders. It discusses defining prediction variables, data requirements, methodologies, and key metrics for model performance, using a case study on diamond prices to illustrate these concepts. The goal is to develop a predictive model that accurately estimates diamond prices to inform purchasing decisions.

Uploaded by

12200.njanani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views39 pages

3-Understanding Business Problems, Prediction Variables, Data Requirement-21-12-20

The document outlines the process of understanding business problems in predictive analytics, emphasizing the importance of domain knowledge and clear communication with stakeholders. It discusses defining prediction variables, data requirements, methodologies, and key metrics for model performance, using a case study on diamond prices to illustrate these concepts. The goal is to develop a predictive model that accurately estimates diamond prices to inform purchasing decisions.

Uploaded by

12200.njanani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

MDI3003

Advanced Predictive Analytics


Dr. Uma Priya D
Assistant Professor Sr. Gr.-I
School of Computer Science and Engineering
Vellore Institute of Technology, Vellore
Module-II
Understanding Business problem

• Prediction Variable

• Data Requirement
• Access to Data

• Solution Method
• Key Metrics - Model Performance

• Diamond prices – Case Study


• Data Collection - Preparation
• Numerical features
• Encoding Categorical Features
• Low Variance Features
• Near Collinearity One-hot Encoding.
Dr. Uma Priya D 2
Understanding Business problem
• Introduction to Problem Understanding
• Understand the Problem

• Importance of Domain Knowledge

• Translating Business needs

• Context is Key

• Define what is going to be predicted

Dr. Uma Priya D 3


Understanding Business problem
Introduction to Problem Understanding and Proposing a Solution
• First Stage: Establish goals with stakeholders.
• Key Questions:
• What problem needs to be solved?
• How does the solution look from a business perspective?

• Predictive analytics works within a specific domain


• The more you understand that domain, the better you will be able to
understand its problems and propose good solutions – Domain Knowledge
matters!
Dr. Uma Priya D 4
Understanding Business problem
Why Domain Knowledge Matters?

• Technical skills (coding, mathematical proofs, etc.) are less critical in


business interactions.

• What matters more?


• Understanding business-specific problems.

• Proposing solutions with measurable impacts such as Cost reduction, Time


savings, Increased customer retention, etc.

Dr. Uma Priya D 5


Understanding Business problem
Translating Business Needs

• Business needs are often vague or general. Examples:


• "Avoid client churn."

• "Understand the impact of competition."

• "Explain rising default rates."

• Solution:
• Translate vague needs into clear analytical problems.

• Gather context and domain-specific details.


Dr. Uma Priya D 6
Understanding Business problem
Context is Key

• Even without expertise, understand enough to propose valuable solutions.

• Learn the vocabulary and business dynamics.

• Understand key metrics (e.g., default rate definitions, churn dynamics).

• Solutions must deliver measurable value to the business.

• Business solutions should be measurable:


• Example: Increase retention, reduce costs, save time.

Dr. Uma Priya D 7


Understanding Business problem
Define what is going to be predicted

• When working in predictive analytics, it is your job to clarify and make


the requirements explicit in terms of the outputs of the model:
• What do the outputs look like?

• In other words, what is being predicted, and what is the target that will solve
the business problem?

Dr. Uma Priya D 8


Understanding Business problem
Define what is going to be predicted - Example

• Suppose we are asked to build a model for predicting the churn of clients in a
telecommunications company; the target, in this case, could be a categorical variable,
with two categories: "churners" (clients who will leave the company) versus "non-
churners“ (clients that will stay).

• However, based on your domain knowledge, you know that in fact there are two types of
churners: "voluntary churners" and "involuntary churners." So, which target is better?

• The first one with two categories (churners versus non-churners) or the second one with
three (voluntary churners, involuntary churners, and non-churners)?

• That answer of course, depends on the business goals for the model, and it will be your
task to recommend or decide which target is better.
Dr. Uma Priya D 9
Understanding Business problem
• Data Requirement
• Access to Data

Dr. Uma Priya D 10


Understanding Business problem
Define Data Requirements
• Once the output of the model has been defined, you should make explicit which
data will be required to solve the problem and produce the predictions you
intend:
• which data sources you need to be able to access,
• in what format the data is needed,
• how much data is needed, and so on.

• Reality of Data Availability:


• Ideal data may not be accessible.
• You may have to work with what is already available (e.g., only 6 months of data instead of
12).
Dr. Uma Priya D 11
Understanding Business problem
Define Data Requirements

• Once the output of the model has been defined, you should make explicit which data will
be required to solve the problem and produce the predictions you intend:
• which data sources you need to be able to access,
• in what format the data is needed,
• how much data is needed, and so on.

• Reality of Data Availability:


• Ideal data may not be accessible.
• You may have to work with what is already available (e.g., only 6 months of data instead of 12).

• Solution: Discuss with the key stakeholders what is the data that they think is relevant
from a business perspective.
Dr. Uma Priya D 12
Understanding Business problem
Data Requirements – Access to Data

• How are you are going to get the dataset?


• Maybe you already have access to the company database, or maybe you need to get permission to access it.
• Another possibility: ask the people in charge (database administrators or data engineers) to provide you with
the data you need; in that case, you need to be really clear in your communication about what you actually
need, so make sure you understand the data's particularities.

Key considerations to keep in mind when asking for a dataset:

• The format in which you expect the dataset

• If given in a table format, the type of each of the dataset columns

• How the missing values will be encoded

• In which encoding the files will be provided

• In the case of historical data, how much time will be needed


Dr. Uma Priya D 13
Understanding Business problem
• Proposing a Solution
• Define your methodology

• Define key metrics of model performance

• Define the deliverables of the project

Dr. Uma Priya D 14


Understanding Business problem
Define the Methodology

• State the Methodology Ahead of Time


• Clearly outline the approach you plan to use for solving the problem.
• Adjust the level of detail based on the audience's technical expertise.

Levels of Methodology Description

• High-Level Explanation:
• Examples: “A classification model will be trained”, “A forecasting model will be developed to estimate monthly
sales”
• Suitable for non-technical stakeholders who are interested in the result, not the process..

• Detailed Explanation:
• Example: “We will use principal component analysis (PCA) to reduce dimensionality, followed by logistic
regression to classify customers into churn or non-churn categories.”
• Necessary for technical stakeholders, such as data science teams or IT departments who need specifics.
Dr. Uma Priya D 15
Understanding Business problem
Define the Methodology (Contd…)

• Always clarify why the proposed methodology is the best choice.


• Example 1: “This method is effective because it handles imbalanced data
well.”
• Example 2: “We chose Random Forest because it handles missing data well
and is robust to outliers, which are present in our dataset.”
• Example 3: “A time series model like ARIMA was selected as it effectively
handles seasonality, which is a key feature in our sales data.”
• Align the explanation with business needs rather than focusing solely on
technical merits. Dr. Uma Priya D 16
Understanding Business problem
Define the Methodology (Contd…)

• Avoid Technical Jargon for Non-Technical Stakeholders


• Key Principle: Do not use this opportunity to show off technical expertise.

• Communicate solutions simply and clearly, focusing on their value to the business.

• Example-1: Instead of saying, “We’ll train a CNN to detect patterns in the image
data,” say: “We’ll build a system to automatically identify defects in product images
to improve quality control.”

• Example-2: Instead of saying, “We’ll cluster data using k-means,” say: “We’ll group
customers with similar behaviors to create targeted marketing strategies.”

Dr. Uma Priya D 17


Understanding Business problem
Define the Methodology (Contd…)
• Know Your Audience
• Tailor your explanation to the audience
• Technical details for technical audiences.
• High-level overviews for business stakeholders.
• Avoid alienating stakeholders with excessive complexity.

• Practical Example of Miscommunication to Avoid:


• Bad Practice: Overloading a presentation with complex neural network diagrams for
non-technical executives.
• Good Practice: Explaining how the model improves efficiency or saves costs, without
delving into algorithmic details unless asked.
Dr. Uma Priya D 18
Understanding Business problem
Define the Key Metrics
• Key metrics are the statistical measures that evaluate the effectiveness of machine learning and
statistical models.
• These metrics can help compare different models and select the best one for a specific task or data set.
• Mean Absolute Error (MAE): The average of the absolute differences between the predicted and
actual values. MAE is a simple metric that measures the absolute difference between the two values.
• Mean Squared Error (MSE): The average of the squared differences between the predicted and actual
values. MSE is a preferred metric for regression tasks.
• Root Mean Squared Error (RMSE): The square root of the MSE result. RMSE is useful because it
expresses the error metric on the same scale as the target variable. A perfect RMSE value is 0.0 or close
to it.
• Precision: A critical metric for evaluation if a model needs to minimize false positives.
• Other metrics for model performance include:
• Classification accuracy, Logarithmic loss, Area under Curve, F1 score, Recall, and Confusion Matrix.
• The choice of metric is important because it directly impacts how the performance is calculated and compared.
Dr. Uma Priya D 19
Understanding Business problem
Define the deliverables of the project
• Project deliverables are the products, services, or outcomes that are required to
complete a project.
• How will the model outputs be used? This can be in many ways: through a dedicated application, via an
API, as a module of an existing application, and so on.

• Besides the predictions of the model, what other outputs are needed?

• Are you going to be required to write a report about the results of your model and analysis?

• Will you be required to deliver a presentation?

• Deliverables can be anything—a new product, marketing campaign, feature


update, a sales deck, a decrease in churn, or an increase in NPS score, and so on.
Dr. Uma Priya D 20
Understanding Business problem
• Diamond prices – Case Study
• Problem understanding and definition

• Proposing a solution at a high level


• Goal

• Methodology

• Metrics

• Deliverables

• Data collection and preparation


• Dealing with missing values

Dr. Uma Priya D 21


Understanding Business problem
• Diamond prices – Problem understanding and definition
A new company, Intelligent Diamond Reseller (IDR), wants to get into the business of reselling diamonds.
They want to innovate in the business, so they will use predictive modeling to estimate how much the
market will pay for diamonds. Of course, to sell diamonds in the market, first they have to buy them from
the producers; this is where predictive modeling becomes useful. Let's say people at IDR know ahead of
time that they will be able to sell a specific diamond in the market for USD 5,000. With that information,
they know how much to pay when buying this diamond. If someone tries to sell that diamond to them for
USD 2,750, then that would be a very good deal; likewise, it would be a bad deal to pay USD 6,000 for
such a diamond. So, as you can see, for IDR it would be very important to be able to predict the price the
market will pay for diamonds accurately. They have been able to get a dataset (this is actually real-world
data) containing the prices and key characteristics of about 54,000 diamonds; here we have the metadata
about the dataset:

• Number of attributes: 10
Dr. Uma Priya D 22
Understanding Business problem
• Diamond prices – Problem understanding and definition (Contd…)
Feature information: A DataFrame with 53,940 rows and 10 variables:

• price: Price in US dollars

• carat: Weight of the diamond

• cut: Quality of the cut (fair, good, very good, premium, ideal)

• color: Diamond color, from J (worst) to D (best)

• clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

• x: Length in mm

• y: Width in mm

• z: Depth in mm

• depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y)

• table: Width of the top of the diamond relative to the widest point
Dr. Uma Priya D 23
Understanding Business problem
• Diamond prices – Problem understanding and definition (Contd…)
Sample Dataset

Dr. Uma Priya D 24


Understanding Business problem
• Diamond prices – Problem understanding and definition (Contd…)
Getting more context

• The most important factor in the price of the diamond is the carat or weight of the diamond. Along
with the carat, other very important characteristics that play an important role in the price of
diamonds are color, clarity, and cut. This is good news, since it seems that we have all these features
contained in our dataset.

• Another key characteristic about diamonds is the certification process, and there is no information
about certification in the dataset, which is potentially problematic since our research shows that
people will be willing to pay much less for a diamond that is not certified. This is one of the key
questions that you will have to ask the IDR people.

• After talking with them, they inform you that they will only deal with certified diamonds and that the
dataset you will work with is about certified diamonds

Dr. Uma Priya D 25


Understanding Business problem
• Diamond prices – proposing a solution at a high level(Contd…)
Goal

• To use the features contained in the dataset (all columns except for the price)

• To build a predictive model that predicts the price of diamonds, as accurately as possible, based
on those features

• To predict the prices of diamonds offered to IDR by the producers, so IDR can decide how much to
pay for those diamonds

Dr. Uma Priya D 26


Understanding Business problem
• Diamond prices – proposing a solution at a high level (Contd…)
Methodology

• Target: price

• Features: carat, cut, color, clarity, x, y, z, depth, and table.(remaining columns in the table)

• Since we are talking about prices, the type of variable we want to predict is a continuous variable;
it can take (in principle) any numeric value within a range.

• Since we are predicting a continuous variable, we are trying to solve a regression problem;

• In predictive analytics, when the target is a numerical variable, we are within a category of
problems known as regression tasks.

• Methodology: Building a regression model with the price of the diamond as a target
Dr. Uma Priya D 27
Understanding Business problem
• Diamond prices – proposing a solution at a high level (Contd…)
Metrics for the model

• The logic behind almost all of the standard metrics is very straightforward:
• If the predictions are close to the actual (real) values then that is
considered good
• Conversely, if the prediction is far away from the real value, then that is
not good

Dr. Uma Priya D 28


Understanding Business problem
• Diamond prices – proposing a solution at a high level (Contd…)
Metrics for the model
• Some common metrics used to evaluate the performance of regression models include:

• Mean Absolute Error (MAE): Quantifies the average magnitude of errors in a set of predictions, without considering
their direction

• Root Mean Squared Error (RMSE): Similar to MAE, but squares the error instead of taking the absolute value

• Mean Squared Error (MSE): The average of the squared differences between the actual and predicted values. A
lower value indicates a better regression model.

• Mean Absolute Percentage Error (MAPE): Used when the target variable feature has a single dimension

• Coefficient of Determination or R-squared: Represents the proportion of the variance in the dependent variable
that is explained by the linear regression model. A higher R-squared indicates that the regression can capture more
variation in the observed dependent variables.

Dr. Uma Priya D 29


Understanding Business problem
• Diamond prices – proposing a solution at a high level (Contd…)
Deliverables for the project

• The people from IDR have stated that they would like a software tool where they can input
the different features of the diamond and based on that, the tool gives back a prediction for
the price of the diamond. That is their only concern; they care only about the price of the
diamond.

• You agree with their request and you propose that the solution will be a simple web
application that will contain a form where they will be able to input the features of a
diamond, and the application will give a prediction of the price based on the model that will
be built using the available dataset.
Dr. Uma Priya D 30
Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
• The data collection process varies significantly depending on the nature of the project.

• Sources of Data:

• Internal Databases: Access organizational databases (e.g., CRM systems, financial records).

• ETL Processes: Extract, transform, and load data from raw sources into usable formats.

• External Data Sources:

• Subscription-based services (e.g., Bloomberg, Quandl).

• Public APIs for open data (e.g., government data portals, social media APIs).

Dr. Uma Priya D 31


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning in Predictive Analytics
• Data cleaning often takes the majority of time in a predictive analytics project.
• No standard procedure; each dataset has unique challenges.
• Identify and address corrupt, incomplete, useless, or incorrect data to improve model
performance and reliability.
Common Tasks in Data Cleaning
1. Handling Missing Data
2. Dealing with Duplicates
3. Addressing Outliers
4. Correcting Incorrect Data
5. Regular Expression (Regex) for CleaningDr.Text Data
Uma Priya D 32
Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

1. Handling Missing Data:

• Missing values can occur due to various reasons, such as data entry errors or incomplete
records.

• Approaches:
• Remove Rows/Columns: Drop rows/columns with too many missing values.

df.dropna(axis=0, inplace=True) # Drop rows with missing values

• Imputation: Fill missing values with mean, median, mode, or a placeholder.

df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Fill with mean

Dr. Uma Priya D 33


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

2. Dealing with Duplicates:

• Duplicate rows can skew analysis.

• Approach:
• Remove duplicates

df.drop_duplicates(inplace=True)

Dr. Uma Priya D 34


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

3. Addressing Outliers:

• Outliers can distort model results.

• Approach:
• Identify and cap or remove extreme values.

import numpy as np

upper_limit = np.percentile(df['column_name'], 95) # 95th percentile

df['column_name'] = np.clip(df['column_name'], None, upper_limit) # Cap values

Dr. Uma Priya D 35


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

4. Correcting Incorrect Data:

• Identify entries that don't make sense (e.g., negative values for age).

• Approach:
• Replace or drop incorrect values.

df.loc[df['age'] < 0, 'age'] = np.nan # Replace negative ages with NaN

Dr. Uma Priya D 36


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

5. Regular Expression (Regex) for Cleaning Text Data:

• Used to identify patterns and clean text data.

• Approach:
• Remove special characters, standardize formats, or extract meaningful text.

import re

df['cleaned_column'] = df['text_column'].str.replace(r'[^\w\s]', '', regex=True) # Remove special characters

Dr. Uma Priya D 37


Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
Data Cleaning

Libraries Commonly Used in Python for Data Cleaning:

• Pandas: For handling missing values, duplicates, and general data manipulation.

• Numpy: For mathematical operations and handling outliers.

• Regex (re module): For pattern matching and text cleaning.

• Scikit-learn: For preprocessing and scaling.

Dr. Uma Priya D 38


Understanding Business problem and Data Preparation
• Diamond prices – Case Study
• Data Collection - Preparation

• Numerical features

• Encoding Categorical Features


Given in Jupyter Notebook
• Low Variance Features

• Near Collinearity One-hot Encoding.

Dr. Uma Priya D 39

You might also like