Advertising in ML
Advertising in ML
September 1, 2024
1) Import libraries Let’s begin by importing the following Python libraries: NumPy, Pandas,
Seaborn,Matplotlib Pyplot, and Matplotlib inline.
[ ]: 2) Import dataset
The second step is to import the dataset into the same cell. import the CSV␣
↪file into Jupyter Notebook using pd.read_csv() and add the file
[4]: df = pd.read_csv(r'D:\Jupyter\Advertising\advertising.csv')
[ ]: This loads the dataset into a Pandas dataframe. You can review the dataframe␣
↪using the head() command and clicking “Run” or by navigating to
Cell > Run All from the top menu. Here first 10 rows
[6]: df.head(10)
[6]: Daily Time Spent on Site Age Area Income Daily Internet Usage \
0 68.95 35 61833.90 256.09
1 80.23 31 68441.85 193.77
2 69.47 26 59785.94 236.50
3 74.15 29 54806.18 245.89
4 68.37 35 73889.99 225.58
5 59.99 23 59761.56 226.74
6 88.91 33 53852.85 208.36
7 66.00 48 24593.33 131.76
8 74.53 30 68862.00 221.51
9 69.88 20 55642.32 183.82
1
3 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
4 Robust logistical utilization South Manuel 0 Iceland
5 Sharable client-driven software Jamieberg 1 Norway
6 Enhanced dedicated support Brandonstad 0 Myanmar
7 Reactive local challenge Port Jefferybury 1 Australia
8 Configurable coherent function West Colin 1 Grenada
9 Mandatory homogeneous architecture Ramirezton 1 Ghana
Timestamp Clicked on Ad
0 2016-03-27 00:53:11 0
1 2016-04-04 01:39:02 0
2 2016-03-13 20:35:42 0
3 2016-01-10 02:31:19 0
4 2016-06-03 03:36:18 0
5 2016-05-19 14:30:17 0
6 2016-01-28 20:59:32 0
7 2016-03-07 01:40:15 1
8 2016-04-18 09:33:42 0
9 2016-07-11 01:42:51 0
3) Remove features
Next, we need to remove non-numerical features that can’t be parsed by this␣
↪algorithm, which includes Ad Topic Line, City, Country, and Timestamp.
[12]: df.head()
[12]: Daily Time Spent on Site Age Area Income Daily Internet Usage \
0 68.95 35 61833.90 256.09
1 80.23 31 68441.85 193.77
2 69.47 26 59785.94 236.50
2
3 74.15 29 54806.18 245.89
4 68.37 35 73889.99 225.58
Clicked on Ad
0 0
1 0
2 0
3 0
4 0
[14]: df.shape
[14]: (1000, 5)
[16]: df.columns
[18]: df.describe()
[28]: df.describe(include='all')
3
freq NaN NaN NaN
mean 65.000200 36.009000 55000.000080
std 15.853615 8.785562 13414.634022
min 32.600000 19.000000 13996.500000
25% 51.360000 29.000000 47031.802500
50% 68.215000 35.000000 57012.300000
75% 78.547500 42.000000 65470.635000
max 91.430000 61.000000 79484.800000
4
[ ]: 4) Scale data
Next we will import the Scikit-learn function StandardScaler, which will be␣
↪used to
standardize features by using zero as the mean for all variables and scaling to␣
↪unit
variance. The mean and standard deviation are then stored and used later with
the transform method (recreates the dataframe with the requested transformed
values).
5
[ ]: After importing StandardScaler, we can assign it as a new variable, fit the␣
↪function
to the features contained in the dataframe, and transform those values under a
new variable name.
[ ]: Take careful note of the next line of code as this is where we reshape the
dataframe’s features into a defined number of components. For this exercise, we
want to find the components that have the most impact on data variability. By
setting the number of components to 2 (n_components=2), we’re asking PCA to find
the two components that best explain variability in the data. The number of
components can be modified according to your requirements, but two components
is the simplest to interpret and visualize on a scatterplot.
[ ]: Next, we need to fit the two components to our scaled data and recreate the
dataframe’s values using the transform method.
[30]: pca.fit(scaled_data)
scaled_pca = pca.transform(scaled_data)
[ ]: Let’s check the transformation using the shape command to compare the two
datasets.
6
[32]: #Query the number of rows and columns in the scaled dataframe
scaled_data.shape
[32]: (1000, 5)
[34]: #Query the number of rows and columns in the scaled PCA dataframe
scaled_pca.shape
[34]: (1000, 2)
[ ]: We can see that the scaled dataframe has been compressed from 1,000 rows with
5 columns to 1,000 rows with 2 columns using PCA.
7
[ ]: The two components are color-coded to delineate the outcome of Clicked on Ad
(Clicked/Did not click). Keep in mind that components don’t correspond to a
single variable but rather a combination of variables.
Finally, we can modify the code to add a color legend. This is a more advanced␣
↪set
of code and requires the use of a for-loop in Python and RGB color codes that␣
↪can
be found at Rapidtables.com.
Version 2: Visualized plot with color legend
[49]: plt.figure(figsize=(10,8))
legend = df['Clicked on Ad']
#Add indigo and yellow RGB colors
colors = {0: '#4B0082', 1: '#FFFF00'}
labels = {0: 'Clicked', 1: 'Did not click'}
#Use a for-loop to set color for each data point
for t in np.unique(legend):
ix = np.where(legend == t)
8
plt.scatter(scaled_pca[ix,0], scaled_pca[ix,1], c=colors[t],␣
↪label=labels[t])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.legend()
plt.show()
[ ]: From this visualization, we can see the clear separation of outcomes with the␣
↪aid
of a color legend in the top right corner. The output of PCA is now ready for
further analysis using a supervised learning technique such as logistic␣
↪regression
or k-nearest neighbors.
[ ]: