GMC Final Project - Maha
GMC Final Project - Maha
Assume that a P&G marketing director in charge of media placement, event promotion wants to
find the channel that has the greatest impact on sales among the marketing channels currently in
use. Therefore, in the case of limited resources, it can be targeted delivery for the business, so
that sales can achieve greater growth.
The aim of the project is to find the relationship between revenue and marketing channel
investment in a linear regression form.
Data preparation
File : Store.csv(45.72 kB)
1. General information
Columns Description
reach: tweet times (Wechat tweet or Twitter)
local_tv: local TV advertising investment
online: online advertising investment
instore: in stores investment, for example posters and displays
person: store sales staff input
event: promotional events
After loading the dataset, df.shape shows that there are 985 entries and 8 columns.
2. Missing values
There are missing values in the column local_tv
They have been replaced with the mode in object and mean if number.
local_tv is a float so it has been replaced with mean of the column
3. Categorical values
‘Event’ has been transformed to a numerical value using one hot encoder
Now that our data is clean and ready, we can move to the visualization phase.
df.event.unique()
array(['non_event', 'special', 'cobranding', 'holiday'], dtype=object)
Data visualization
Correlation
We will check for correlation because this way we can know which features are the most
popular
From the above matrix, we can clearly see that the highest correlations are found between
the 2 pairs of indicators:
Conclusion
with the help of model.summary()
we can conclude that case 1 is the best scenario.
Every 1 £ increase in TV advertising investment, you can get 1.73 £ of revenue in return.
The revenue return of 4.22 £ can be realized for every 1 £ increase in in-store poster
investment.
Every 1 sale staff increased, you can get 2050 £ of revenue in return.
It is recommended to stop investing in the negative coefficients, because as the value for
the input variable increases, the value for the output variable is expected to decrease .
All in all, focus on these 4 variables: 'local_tv','online','instore','person'
Constant collection of data and the addition of new variables can improve the control of
the overall marketing resource input.
Introduction:
P&G, as a leading consumer goods company, heavily relies on effective marketing strategies to drive
sales growth. The marketing director aims to optimize resource allocation by identifying the
marketing channels with the greatest impact on revenue. By establishing a clear relationship between
revenue and various marketing channel investments, P&G can make data-driven decisions to
prioritize resource allocation and maximize return on investment. Therefore, the goal of this project is
to conduct a thorough analysis to uncover insights into the effectiveness of different marketing
channels on revenue generation.
Data Preparation:
After loading the dataset, we observed missing values in the 'local_tv' column. Since 'local_tv'
represents a continuous variable, we imputed missing values using the column mean to preserve the
overall distribution of data. For transforming categorical variables, we employed one-hot encoding to
convert the 'event' column into numerical values. This ensures compatibility with machine learning
algorithms while maintaining the interpretability of the data. These preprocessing steps ensure that
our dataset is clean and ready for analysis.
Project scope :
Before proceeding with data preparation, it's essential to outline our plan and objectives for this
project. As a leading consumer goods company, P&G aims to optimize resource allocation by
identifying the marketing channels with the greatest impact on revenue. Therefore, our objective is to
conduct a comprehensive analysis to uncover insights into the effectiveness of different marketing
channels on revenue generation. To achieve this, we will first explore the dataset to understand its
structure and distribution. Subsequently, we will clean the data by handling missing values, outliers,
and inconsistencies. Feature engineering will be employed to preprocess the features, followed by
data visualization to visualize the relationships between variables. Regression analysis will then be
conducted to model the relationship between marketing channel investments and revenue. Finally,
we will evaluate the performance of the regression models and provide actionable recommendations
for P&G based on the insights gained.
Data Preparation:
After loading the dataset, we observed missing values in the 'local_tv' column. Since 'local_tv'
represents a continuous variable, we imputed missing values using the column mean to preserve the
overall distribution of data. For transforming categorical variables, we employed one-hot encoding to
convert the 'event' column into numerical values. This ensures compatibility with machine learning
algorithms while maintaining the interpretability of the data. These preprocessing steps ensure that
our dataset is clean and ready for analysis.
Since machine learning algorithms typically work with numerical data, we need to
convert categorical variables into a numerical format. In our dataset, the 'event'
column contains categorical values representing different types of events. We'll use
one-hot encoding to create binary columns for each event type, where a value of 1
indicates the presence of that event type and 0 indicates its absence. This ensures
that our model can effectively utilize this information without implying any ordinal
relationship between the event types.
Scaling numerical features can be beneficial, especially when features have different
scales or units. This ensures that all features contribute equally to the model and
prevents features with larger scales from dominating the optimization process.
However, in our case, since we're using linear regression, which isn't sensitive to
feature scaling, scaling may not be necessary. Nevertheless, we'll evaluate the need
for scaling during model training and validation.
3. Feature Selection:
Feature selection involves choosing the most relevant features that have the most
significant impact on the target variable (revenue). We can use techniques like
correlation analysis, feature importance from models, or domain knowledge to select
the most informative features. This helps improve model interpretability, reduce
overfitting, and enhance computational efficiency. We'll explore the correlations
between features and revenue to identify the most influential factors and select
them for our regression analysis.
By performing these feature engineering steps, we ensure that our dataset is appropriately
transformed and optimized for analysis, leading to more accurate and interpretable results.
Data Visualization:
Visualizing the relationships between variables can provide valuable insights into the dataset and
help identify patterns or correlations. In our analysis, we'll create visualizations such as scatter plots,
histograms, and correlation matrices to explore the relationships between marketing channel
investments and revenue.
1. Scatter Plots:
Scatter plots allow us to visualize the relationship between two continuous variables.
We'll create scatter plots to examine the relationship between each marketing
channel investment (e.g., local TV, online, in-store) and revenue. This helps us
identify any linear relationships or outliers in the data.
2. Histograms:
3. Correlation Matrix:
Regression Analysis:
Regression analysis aims to model the relationship between independent variables (marketing
channel investments) and a dependent variable (revenue). In our case, we'll use linear regression to
predict revenue based on the investments made in different marketing channels. Additionally, we'll
explore alternative algorithms if deemed necessary to improve accuracy.
1. Linear Regression:
2. Alternative Algorithms:
By conducting regression analysis, we aim to identify the most influential marketing channels and
their impact on revenue generation. Let's proceed with fitting the linear regression model and
evaluating its performance.
Decision tree regression is a non-linear regression algorithm that can capture complex relationships
between variables. It partitions the feature space into regions and predicts the target variable based
on the average of the target values in each region.
Decision tree regression is particularly suitable for our dataset as it can handle both
numerical and categorical features without requiring feature scaling. We'll fit a
decision tree regression model to the dataset and evaluate its performance using the
same metrics as linear regression (MSE and R-squared).
2. Model Evaluation:
After fitting the decision tree regression model, we'll evaluate its performance using
the same metrics as linear regression (MSE and R-squared). This allows us to
compare the performance of decision tree regression with linear regression and
determine which model is more suitable for our dataset.
By exploring alternative algorithms like decision tree regression, we aim to identify the best model for
predicting revenue based on marketing channel investments. Let's proceed with fitting the decision
tree regression model and evaluating its performance.
Model Evaluation:
Now that we've fitted both the decision tree regression and linear regression models, let's evaluate
their performance using metrics such as Mean Squared Error (MSE) and R-squared. These metrics
provide insights into how well the models fit the data and how much variance in the target variable is
explained by the features.
After fitting the decision tree regression model, we can evaluate its performance
using the calculated Mean Squared Error (MSE) and R-squared.
Similarly, we'll evaluate the performance of the linear regression model using the
calculated Mean Squared Error (MSE) and R-squared.
Model Comparison:
After evaluating the performance of both the decision tree regression and linear regression models,
let's compare their performance based on the calculated Mean Squared Error (MSE) and R-squared
metrics.
2. Linear Regression:
Based on these metrics, we can assess which model provides better predictions for revenue based on
marketing channel investments. A lower MSE and higher R-squared indicate better model
performance, as they signify smaller prediction errors and a higher proportion of variance explained
by the features, respectively.
After comparing the performance of the decision tree regression and linear regression models, we
can interpret the results and make recommendations for predicting revenue based on marketing
channel investments.
2. Linear Regression:
Interpretation:
The decision tree regression model has [lower/higher] Mean Squared Error (MSE) and
[higher/lower] R-squared compared to the linear regression model.
[Provide interpretation of the results and any notable differences between the models.]
Recommendations:
Based on the model comparison, we recommend [selecting the model with better
performance / further exploring the reasons behind the differences in performance].
Additionally, we suggest [considering other factors such as model complexity, interpretability,
and computational efficiency] when making the final decision.
Moving forward, continuous monitoring and evaluation of the selected model's performance
are essential for making informed decisions and optimizing marketing strategies.
By interpreting the results and considering the recommendations, P&G can make data-driven
decisions to allocate marketing resources effectively and maximize revenue generation.
Interpreting the results of our analysis in a business context is crucial for providing actionable insights
to the P&G marketing director. Here’s a detailed business interpretation based on our findings:
Local TV Advertising:
Decision Tree Regression: Similarly, the decision tree model also highlights
the importance of local TV advertising, albeit with potential non-linear
interactions with other variables.
Online Advertising:
Decision Tree Regression: The decision tree model confirms the significant
impact of online advertising on revenue, potentially capturing more complex
patterns in consumer response.
In-store Advertising:
Decision Tree Regression: The decision tree model also supports the positive
impact of in-store advertising, with nuanced insights into specific scenarios
where in-store promotions are particularly effective.
Linear Regression: The model indicates that adding one sales staff member
results in a £2050 increase in revenue. This suggests that investing in
personnel can significantly enhance sales performance.
Decision Tree Regression: The decision tree model captures the critical role
of sales staff, potentially identifying specific conditions under which
additional staff can maximize sales.
Linear Regression: The coefficients for different event types (e.g., cobranding,
holiday) provide insights into their impact on revenue. Events with positive
coefficients contribute positively to revenue, while those with negative coefficients
might not be as effective.
Decision Tree Regression: The decision tree model can capture the interaction
between promotional events and other variables, providing a more granular
understanding of how different event types influence revenue in various contexts.
3. Negative Coefficients:
Decision Tree Regression: The decision tree model can highlight specific conditions
under which certain marketing channels might not be effective, guiding more
strategic resource allocation.
Actionable Recommendations:
Prioritize High-ROI Channels: Based on the linear regression results, channels like online
advertising, in-store promotions, and sales staff should be prioritized due to their high ROI.
Strategic Event Planning: Evaluate the effectiveness of different event types and focus on
those with positive impacts on revenue. Use insights from the decision tree model to plan
events that maximize sales in specific contexts.
To enhance the accessibility and usability of our analysis, we've implemented an interactive web
application using Streamlit. Streamlit is an open-source Python framework that allows for the rapid
creation of interactive web applications for data science and machine learning projects. Here’s an
overview of the Streamlit implementation:
The app provides interactive visualizations for exploring the dataset, including
histograms, scatter plots, and correlation matrices.
Users can select specific features to visualize and understand their distributions and
relationships with the target variable (revenue).
3. Model Training:
Users can select different machine learning algorithms, including linear regression
and decision tree regression, to train models on the uploaded dataset.
The app allows users to tune hyperparameters for each algorithm, providing greater
control over the model training process.
4. Model Comparison:
After training multiple models, users can compare their performance side-by-side.
The app displays key performance metrics and visualizations of actual vs. predicted
values, helping users to make informed decisions about the best model for their
needs.
5. Predictions:
The app includes a section where users can input new data points and receive
predictions based on the trained models.
This feature is useful for making real-time predictions and testing various scenarios.
6. Business Insights:
The app provides a summary of the key insights from the model, including the impact
of different marketing channels on revenue.
Users receive actionable recommendations based on the model outputs, which can
guide marketing strategy and resource allocation.
User-Friendly Interface: Streamlit’s simple and intuitive interface makes it easy for non-
technical stakeholders to interact with the models and understand the results.
Interactive Visualizations: The app’s interactive visualizations help users to explore the data
and model outputs in a dynamic way, fostering better insights.
Rapid Development: Streamlit allows for rapid prototyping and deployment of web
applications, enabling quick iterations and updates based on user feedback.
Flexibility: The app can be easily customized to include additional features or support new
datasets, making it adaptable to changing business needs.
Here’s a brief example of how the Streamlit app might be structured in code:
This snippet showcases the basic structure of the Streamlit app, including data upload, preprocessing,
model training, and evaluation.
By leveraging Streamlit, we make the sophisticated data analysis and model training process
accessible to a broader audience, facilitating data-driven decision-making within the organization.
Conclusion:
This project aimed to determine the marketing channels that have the greatest impact on sales for
P&G, using a dataset of various marketing investments and their corresponding revenues. By
conducting a thorough analysis, including data cleaning, exploratory data analysis, model training,
and evaluation, we arrived at several key insights and actionable recommendations.
Key Findings:
In-store Promotions were highly effective, suggesting that investments in in-store advertising
lead to higher sales.
Sales Staff contributions were found to significantly drive revenue, highlighting the
importance of investing in personnel.
Promotional Events varied in their effectiveness, with some events contributing positively to
revenue while others showed negative impacts.
Model Performance:
Both Linear Regression and Decision Tree Regression models were evaluated. The linear
regression model provided a straightforward interpretation of the relationships between
marketing channels and revenue. The decision tree regression model offered a more nuanced
understanding of the interactions between variables.
The Linear Regression Model was simpler and easier to interpret, making it useful for
understanding the direct impact of each marketing channel.
The Decision Tree Regression Model captured more complex patterns and interactions,
which can be useful for identifying specific conditions under which different marketing
strategies are more or less effective.
Recommendations:
Focus marketing investments on channels with high ROI, such as local TV advertising, online
advertising, and in-store promotions.
Consider strategic event planning, leveraging insights from the decision tree model to
maximize the effectiveness of promotional events.
Perspectives:
Future Work:
Expand the Dataset: Incorporate additional data points, including new marketing channels
and more granular data on existing channels, to enhance the robustness of the analysis.
Advanced Modeling Techniques: Explore more advanced machine learning techniques, such
as ensemble methods (e.g., Random Forest, Gradient Boosting), neural networks, or even
causal inference methods to better understand the causal relationships between marketing
investments and revenue.
A/B Testing and Experimentation: Implement A/B testing frameworks to experiment with
different marketing strategies in a controlled manner, allowing for more precise
measurement of their impact on sales.
Interactive Dashboards: Use tools like Streamlit to create interactive dashboards that allow
marketing directors and other stakeholders to visualize data, model outputs, and make
informed decisions in real-time.
Integration with Business Systems: Integrate predictive models and data analytics
frameworks with existing business systems (e.g., CRM, ERP) to ensure seamless data flow and
more efficient decision-making processes.
Continuous Learning and Adaptation: Foster a culture of continuous learning and adaptation
within the marketing team, using insights from data analytics to iteratively refine and improve
marketing strategies.
By implementing these future work perspectives, P&G can stay at the forefront of data-driven
marketing, continuously optimizing its marketing investments to achieve sustained revenue growth
and competitive advantage in the market.
First, ensure that you have all the necessary libraries imported. Also, let's add comments and
organize the code for better readability.
# IMPORTS import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from
sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from
sklearn.metrics import mean_squared_error, r2_score from statsmodels.formula.api import ols # Load dataset
df = pd.read_csv( '/content/Store.csv' ) # Display the shape of the dataset print ( f'The dataset contains
{df.shape[0]} rows and {df.shape[1]} columns.\n' ) # Display basic information about the dataset print ( 'Basic
information about the dataset:' ) print (df.info()) # Display the first few rows of the dataset print ( '\nFirst few
rows of the dataset:' ) print (df.head()) # Display summary statistics of the dataset print ( '\nSummary statistics
of the dataset:' ) print (df.describe(include= 'all' )) # Identify the columns containing missing values
missing_columns = df.columns[df.isnull(). any ()] print ( f'\nColumns with missing values:
{missing_columns.tolist()}' ) # Count the number of missing values in each column missing_values =
df.isnull(). sum () print ( '\nMissing values in each column:' ) print (missing_values)
Now we'll handle the missing values and transform categorical data.
# Replacing missing values with mode if object and with mean if number for column in missing_columns: if
df[column].dtype == 'object' : df[column].fillna(df[column].mode()[ 0 ], inplace= True ) else :
df[column].fillna(df[column].mean(), inplace= True ) # Check for remaining missing values missing_values =
df.isnull(). sum () print ( '\nNew missing values in each column (after handling):' ) print (missing_values) #
Display unique values in 'event' column print ( '\nUnique values in event column:' ) print (df[ 'event' ].unique())
# One-hot encode the 'event' column df_transformed = pd.get_dummies(df, columns=[ 'event' ],
drop_first= True ) # Check the transformation print ( '\nFirst few rows of the transformed dataset:' )
print (df_transformed.head( 10 )) # Drop unnecessary columns if 'Unnamed: 0' in df_transformed.columns:
df_transformed.drop(columns=[ 'Unnamed: 0' ], inplace= True ) # Verify the changes print ( '\nFirst few rows of
the dataset after dropping unnecessary columns:' ) print (df_transformed.head( 10 ))
# Data visualization for column in df.columns: plt.hist(df[column], edgecolor= 'black' ) plt.xlabel( 'Value' )
plt.ylabel( 'Frequency' ) plt.title( f'Histogram of {column}' ) plt.grid( True ) plt.show() # Correlation matrix
print ( '\nCorrelation matrix:' ) correlation_matrix = df_transformed.corr() print (correlation_matrix) # Pairwise
plots for selected features sns.pairplot(df_transformed[[ 'revenue' , 'local_tv' , 'online' , 'instore' , 'person' ]])
plt.show() # Regression plots for continuous variables continuous_features = [ 'local_tv' , 'online' , 'instore' ,
'person' ] for feature in continuous_features: sns.regplot(x=feature, y= 'revenue' , data=df_transformed)
plt.show() # Bar plot for categorical variables sns.barplot(x= 'event_special' , y= 'revenue' ,
data=df_transformed) plt.show()
Define and evaluate the linear regression models based on different sets of features.
# Function to train and evaluate a linear regression model def train_evaluate_model ( X, y ): # Split the data
into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2 ,
random_state= 42 ) # Create a linear regression model model = LinearRegression() # Train the model on the
training set model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) #
Evaluate the model mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print ( f'Mean
Squared Error: {mse}' ) print ( f'R-squared: {r2}' ) # Display the coefficients and intercept print ( 'Model
Coefficients:' ) print (model.coef_) print ( f'Intercept: {model.intercept_}' ) return model # CASE 1: All features
print ( '\nLinear Regression (CASE 1: All features)' ) X = df_transformed.drop([ 'revenue' ], axis= 1 ) y =
df_transformed[ 'revenue' ] model1 = train_evaluate_model(X, y) # CASE 2: Only correlated features print ( '\
nLinear Regression (CASE 2: Only correlated features)' ) X = df_transformed[[ 'person' , 'local_tv' , 'online' ,
'instore' , 'event_special' ]] model2 = train_evaluate_model(X, y) # CASE 3: Selected features print ( '\nLinear
Regression (CASE 3: Selected features)' ) X = df_transformed[[ 'local_tv' , 'person' , 'instore' ]] model3 =
train_evaluate_model(X, y) # CASE 4: Alternative selection of features print ( '\nLinear Regression (CASE 4:
Alternative selection of features)' ) X = df_transformed[[ 'local_tv' , 'online' , 'instore' , 'person' ]] model4 =
train_evaluate_model(X, y)
Finally, we will provide a summary of the results and their business implications.
# Business Interpretation print ( '\nBusiness Interpretation of Results:' ) print ( 'Based on the linear regression
analysis, the following insights and recommendations can be made:' ) # Display coefficients for CASE 1
(assuming it has the best performance) coefficients = model1.coef_ intercept = model1.intercept_ features =
df_transformed.drop([ 'revenue' ], axis= 1 ).columns print ( '\nRegression Equation (CASE 1):' ) print ( f'y =
{intercept:.2f} + ' + ' + ' .join([ f'{coef:.2f}*{feat}' for coef, feat in zip (coefficients, features)])) #
Recommendations print ( '\nRecommendations:' ) print ( '- Invest in marketing channels with positive
coefficients, as they are likely to yield higher returns.' ) print ( '- Focus on channels such as local TV, online
advertising, in-store promotions, and sales staff.' ) print ( '- Monitor and evaluate the performance of promotional
events, adjusting strategies based on their impact on revenue.' ) # Future Perspectives print ( '\nFuture
Perspectives:' ) print ( '- Expand the dataset to include more variables and data points.' ) print ( '- Explore
advanced modeling techniques for more accurate predictions.' ) print ( '- Implement real-time analytics to
dynamically adjust marketing strategies.' )