3-Understanding Business Problems, Prediction Variables, Data Requirement-21-12-20
3-Understanding Business Problems, Prediction Variables, Data Requirement-21-12-20
• Prediction Variable
• Data Requirement
• Access to Data
• Solution Method
• Key Metrics - Model Performance
• Context is Key
• Solution:
• Translate vague needs into clear analytical problems.
• In other words, what is being predicted, and what is the target that will solve
the business problem?
• Suppose we are asked to build a model for predicting the churn of clients in a
telecommunications company; the target, in this case, could be a categorical variable,
with two categories: "churners" (clients who will leave the company) versus "non-
churners“ (clients that will stay).
• However, based on your domain knowledge, you know that in fact there are two types of
churners: "voluntary churners" and "involuntary churners." So, which target is better?
• The first one with two categories (churners versus non-churners) or the second one with
three (voluntary churners, involuntary churners, and non-churners)?
• That answer of course, depends on the business goals for the model, and it will be your
task to recommend or decide which target is better.
Dr. Uma Priya D 9
Understanding Business problem
• Data Requirement
• Access to Data
• Once the output of the model has been defined, you should make explicit which data will
be required to solve the problem and produce the predictions you intend:
• which data sources you need to be able to access,
• in what format the data is needed,
• how much data is needed, and so on.
• Solution: Discuss with the key stakeholders what is the data that they think is relevant
from a business perspective.
Dr. Uma Priya D 12
Understanding Business problem
Data Requirements – Access to Data
• High-Level Explanation:
• Examples: “A classification model will be trained”, “A forecasting model will be developed to estimate monthly
sales”
• Suitable for non-technical stakeholders who are interested in the result, not the process..
• Detailed Explanation:
• Example: “We will use principal component analysis (PCA) to reduce dimensionality, followed by logistic
regression to classify customers into churn or non-churn categories.”
• Necessary for technical stakeholders, such as data science teams or IT departments who need specifics.
Dr. Uma Priya D 15
Understanding Business problem
Define the Methodology (Contd…)
• Communicate solutions simply and clearly, focusing on their value to the business.
• Example-1: Instead of saying, “We’ll train a CNN to detect patterns in the image
data,” say: “We’ll build a system to automatically identify defects in product images
to improve quality control.”
• Example-2: Instead of saying, “We’ll cluster data using k-means,” say: “We’ll group
customers with similar behaviors to create targeted marketing strategies.”
• Besides the predictions of the model, what other outputs are needed?
• Are you going to be required to write a report about the results of your model and analysis?
• Methodology
• Metrics
• Deliverables
• Number of attributes: 10
Dr. Uma Priya D 22
Understanding Business problem
• Diamond prices – Problem understanding and definition (Contd…)
Feature information: A DataFrame with 53,940 rows and 10 variables:
• cut: Quality of the cut (fair, good, very good, premium, ideal)
• clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
• x: Length in mm
• y: Width in mm
• z: Depth in mm
• table: Width of the top of the diamond relative to the widest point
Dr. Uma Priya D 23
Understanding Business problem
• Diamond prices – Problem understanding and definition (Contd…)
Sample Dataset
• The most important factor in the price of the diamond is the carat or weight of the diamond. Along
with the carat, other very important characteristics that play an important role in the price of
diamonds are color, clarity, and cut. This is good news, since it seems that we have all these features
contained in our dataset.
• Another key characteristic about diamonds is the certification process, and there is no information
about certification in the dataset, which is potentially problematic since our research shows that
people will be willing to pay much less for a diamond that is not certified. This is one of the key
questions that you will have to ask the IDR people.
• After talking with them, they inform you that they will only deal with certified diamonds and that the
dataset you will work with is about certified diamonds
• To use the features contained in the dataset (all columns except for the price)
• To build a predictive model that predicts the price of diamonds, as accurately as possible, based
on those features
• To predict the prices of diamonds offered to IDR by the producers, so IDR can decide how much to
pay for those diamonds
• Target: price
• Features: carat, cut, color, clarity, x, y, z, depth, and table.(remaining columns in the table)
• Since we are talking about prices, the type of variable we want to predict is a continuous variable;
it can take (in principle) any numeric value within a range.
• Since we are predicting a continuous variable, we are trying to solve a regression problem;
• In predictive analytics, when the target is a numerical variable, we are within a category of
problems known as regression tasks.
• Methodology: Building a regression model with the price of the diamond as a target
Dr. Uma Priya D 27
Understanding Business problem
• Diamond prices – proposing a solution at a high level (Contd…)
Metrics for the model
• The logic behind almost all of the standard metrics is very straightforward:
• If the predictions are close to the actual (real) values then that is
considered good
• Conversely, if the prediction is far away from the real value, then that is
not good
• Mean Absolute Error (MAE): Quantifies the average magnitude of errors in a set of predictions, without considering
their direction
• Root Mean Squared Error (RMSE): Similar to MAE, but squares the error instead of taking the absolute value
• Mean Squared Error (MSE): The average of the squared differences between the actual and predicted values. A
lower value indicates a better regression model.
• Mean Absolute Percentage Error (MAPE): Used when the target variable feature has a single dimension
• Coefficient of Determination or R-squared: Represents the proportion of the variance in the dependent variable
that is explained by the linear regression model. A higher R-squared indicates that the regression can capture more
variation in the observed dependent variables.
• The people from IDR have stated that they would like a software tool where they can input
the different features of the diamond and based on that, the tool gives back a prediction for
the price of the diamond. That is their only concern; they care only about the price of the
diamond.
• You agree with their request and you propose that the solution will be a simple web
application that will contain a form where they will be able to input the features of a
diamond, and the application will give a prediction of the price based on the model that will
be built using the available dataset.
Dr. Uma Priya D 30
Understanding Business problem and Data Preparation
• Diamond prices – Data collection and preparation (Contd…)
• The data collection process varies significantly depending on the nature of the project.
• Sources of Data:
• Internal Databases: Access organizational databases (e.g., CRM systems, financial records).
• ETL Processes: Extract, transform, and load data from raw sources into usable formats.
• Public APIs for open data (e.g., government data portals, social media APIs).
• Missing values can occur due to various reasons, such as data entry errors or incomplete
records.
• Approaches:
• Remove Rows/Columns: Drop rows/columns with too many missing values.
• Approach:
• Remove duplicates
df.drop_duplicates(inplace=True)
3. Addressing Outliers:
• Approach:
• Identify and cap or remove extreme values.
import numpy as np
• Identify entries that don't make sense (e.g., negative values for age).
• Approach:
• Replace or drop incorrect values.
• Approach:
• Remove special characters, standardize formats, or extract meaningful text.
import re
• Pandas: For handling missing values, duplicates, and general data manipulation.
• Numerical features