0% found this document useful (0 votes)
14 views9 pages

Advertising in ML

advertising in ML codes

Uploaded by

Suman Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Advertising in ML

advertising in ML codes

Uploaded by

Suman Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

advertising

September 1, 2024

1) Import libraries Let’s begin by importing the following Python libraries: NumPy, Pandas,
Seaborn,Matplotlib Pyplot, and Matplotlib inline.

[2]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

[ ]: 2) Import dataset
The second step is to import the dataset into the same cell. import the CSV␣
↪file into Jupyter Notebook using pd.read_csv() and add the file

directory according to your operating system.

[4]: df = pd.read_csv(r'D:\Jupyter\Advertising\advertising.csv')

[ ]: This loads the dataset into a Pandas dataframe. You can review the dataframe␣
↪using the head() command and clicking “Run” or by navigating to

Cell > Run All from the top menu. Here first 10 rows

[6]: df.head(10)

[6]: Daily Time Spent on Site Age Area Income Daily Internet Usage \
0 68.95 35 61833.90 256.09
1 80.23 31 68441.85 193.77
2 69.47 26 59785.94 236.50
3 74.15 29 54806.18 245.89
4 68.37 35 73889.99 225.58
5 59.99 23 59761.56 226.74
6 88.91 33 53852.85 208.36
7 66.00 48 24593.33 131.76
8 74.53 30 68862.00 221.51
9 69.88 20 55642.32 183.82

Ad Topic Line City Male Country \


0 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
1 Monitored national standardization West Jodi 1 Nauru
2 Organic bottom-line service-desk Davidton 0 San Marino

1
3 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
4 Robust logistical utilization South Manuel 0 Iceland
5 Sharable client-driven software Jamieberg 1 Norway
6 Enhanced dedicated support Brandonstad 0 Myanmar
7 Reactive local challenge Port Jefferybury 1 Australia
8 Configurable coherent function West Colin 1 Grenada
9 Mandatory homogeneous architecture Ramirezton 1 Ghana

Timestamp Clicked on Ad
0 2016-03-27 00:53:11 0
1 2016-04-04 01:39:02 0
2 2016-03-13 20:35:42 0
3 2016-01-10 02:31:19 0
4 2016-06-03 03:36:18 0
5 2016-05-19 14:30:17 0
6 2016-01-28 20:59:32 0
7 2016-03-07 01:40:15 1
8 2016-04-18 09:33:42 0
9 2016-07-11 01:42:51 0

[ ]: We can see that the dataset comprises 10 features including


Daily Time Spent on Site, Age, Area Income, Daily Internet Usage, Ad Topic␣
↪Line, City, Male, Country,Timestamp, and Clicked on Ad.

3) Remove features
Next, we need to remove non-numerical features that can’t be parsed by this␣
↪algorithm, which includes Ad Topic Line, City, Country, and Timestamp.

Although the Timestamp values are expressed in numerals, their special␣


↪formatting is not compatible with the mathematical calculations that

must be made between variables using this algorithm.


We also need to remove the discrete variable Male, which is expressed as an␣
↪integer

(0 or 1), as our model only examines continuous input features.


Let’s remove the five features from the dataset using the del function and
specifying the column titles we wish to remove.

[10]: del df['Ad Topic Line']


del df['City']
del df['Country']
del df['Timestamp']
del df['Male']

[12]: df.head()

[12]: Daily Time Spent on Site Age Area Income Daily Internet Usage \
0 68.95 35 61833.90 256.09
1 80.23 31 68441.85 193.77
2 69.47 26 59785.94 236.50

2
3 74.15 29 54806.18 245.89
4 68.37 35 73889.99 225.58

Clicked on Ad
0 0
1 0
2 0
3 0
4 0

[14]: df.shape

[14]: (1000, 5)

[16]: df.columns

[16]: Index(['Daily Time Spent on Site', 'Age', 'Area Income',


'Daily Internet Usage', 'Clicked on Ad'],
dtype='object')

[18]: df.describe()

[18]: Daily Time Spent on Site Age Area Income \


count 1000.000000 1000.000000 1000.000000
mean 65.000200 36.009000 55000.000080
std 15.853615 8.785562 13414.634022
min 32.600000 19.000000 13996.500000
25% 51.360000 29.000000 47031.802500
50% 68.215000 35.000000 57012.300000
75% 78.547500 42.000000 65470.635000
max 91.430000 61.000000 79484.800000

Daily Internet Usage Clicked on Ad


count 1000.000000 1000.00000
mean 180.000100 0.50000
std 43.902339 0.50025
min 104.780000 0.00000
25% 138.830000 0.00000
50% 183.130000 0.50000
75% 218.792500 1.00000
max 269.960000 1.00000

[28]: df.describe(include='all')

[28]: Daily Time Spent on Site Age Area Income \


count 1000.000000 1000.000000 1000.000000
unique NaN NaN NaN
top NaN NaN NaN

3
freq NaN NaN NaN
mean 65.000200 36.009000 55000.000080
std 15.853615 8.785562 13414.634022
min 32.600000 19.000000 13996.500000
25% 51.360000 29.000000 47031.802500
50% 68.215000 35.000000 57012.300000
75% 78.547500 42.000000 65470.635000
max 91.430000 61.000000 79484.800000

Daily Internet Usage Ad Topic Line City \


count 1000.000000 1000 1000
unique NaN 1000 969
top NaN Cloned 5thgeneration orchestration Lisamouth
freq NaN 1 3
mean 180.000100 NaN NaN
std 43.902339 NaN NaN
min 104.780000 NaN NaN
25% 138.830000 NaN NaN
50% 183.130000 NaN NaN
75% 218.792500 NaN NaN
max 269.960000 NaN NaN

Male Country Timestamp Clicked on Ad


count 1000.000000 1000 1000 1000.00000
unique NaN 237 1000 NaN
top NaN France 2016-03-27 00:53:11 NaN
freq NaN 9 1 NaN
mean 0.481000 NaN NaN 0.50000
std 0.499889 NaN NaN 0.50025
min 0.000000 NaN NaN 0.00000
25% 0.000000 NaN NaN 0.00000
50% 0.000000 NaN NaN 0.50000
75% 1.000000 NaN NaN 1.00000
max 1.000000 NaN NaN 1.00000

[20]: sns.pairplot(df,vars=['Area Income','Daily Time Spent on Site','Daily Internet␣


↪Usage'])

[20]: <seaborn.axisgrid.PairGrid at 0x266bcbf2bd0>

4
[ ]: 4) Scale data
Next we will import the Scikit-learn function StandardScaler, which will be␣
↪used to

standardize features by using zero as the mean for all variables and scaling to␣
↪unit

variance. The mean and standard deviation are then stored and used later with
the transform method (recreates the dataframe with the requested transformed
values).

[22]: #Import StandardScaler


from sklearn.preprocessing import StandardScaler

5
[ ]: After importing StandardScaler, we can assign it as a new variable, fit the␣
↪function

to the features contained in the dataframe, and transform those values under a
new variable name.

[24]: scaler = StandardScaler()


scaler.fit(df)
scaled_data = scaler.transform(df)

[ ]: StandardScaler is often used in conjunction with PCA and other algorithms


including k-nearest neighbors and support vector machines to rescale and
standardize data features. This gives the dataset the properties of a standard
normal distribution with a mean of zero and a standard deviation of one.
Without standardization, the PCA algorithm is likely to lock onto features that
maximize variance but that could be exaggerated by another factor. To provide an
example, the variance of Age changes dramatically when measured in days in place
of years, and if left unchecked, this type of formatting might mislead the␣
↪selection

of components which is based on maximizing variance. StandardScaler helps to


avoid this problem by rescaling and standardizing variables.
Conversely, standardization might not be necessary for PCA if the scale of the
variables is relevant to your analysis or consistent across variables.
5) Assign algorithm
Having laid much of the groundwork for our model, we can now import the PCA
algorithm from Scikit-learn’s decomposition library.

[26]: from sklearn.decomposition import PCA

[ ]: Take careful note of the next line of code as this is where we reshape the
dataframe’s features into a defined number of components. For this exercise, we
want to find the components that have the most impact on data variability. By
setting the number of components to 2 (n_components=2), we’re asking PCA to find
the two components that best explain variability in the data. The number of
components can be modified according to your requirements, but two components
is the simplest to interpret and visualize on a scatterplot.

[28]: pca = PCA(n_components=2)

[ ]: Next, we need to fit the two components to our scaled data and recreate the
dataframe’s values using the transform method.

[30]: pca.fit(scaled_data)
scaled_pca = pca.transform(scaled_data)

[ ]: Let’s check the transformation using the shape command to compare the two
datasets.

6
[32]: #Query the number of rows and columns in the scaled dataframe
scaled_data.shape

[32]: (1000, 5)

[ ]: Now query the shape of the scaled PCA dataframe.

[34]: #Query the number of rows and columns in the scaled PCA dataframe
scaled_pca.shape

[34]: (1000, 2)

[ ]: We can see that the scaled dataframe has been compressed from 1,000 rows with
5 columns to 1,000 rows with 2 columns using PCA.

[36]: #State the size of the plot


plt.figure(figsize=(10,8))

[36]: <Figure size 1000x800 with 0 Axes>

<Figure size 1000x800 with 0 Axes>

[ ]: 6) Visualize the output


Let’s use the Python plotting library Matplotlib to visualize the two principal
components on a 2-D scatterplot, with principal component 1 marked on the x-axis
and principal component 2 on the y-axis.
We’ll visualize the two principal components without a color legend in the first
version of the code before adding code for the color legend in the second␣
↪version.

[45]: #State the size of the plot


plt.figure(figsize=(10,8))
#Configure the scatterplot’s x and y axes as principal components 1 and 2, and␣
↪color-coded by the variable Clicked on Ad.

plt.scatter(scaled_pca[:, 0],scaled_pca[:, 1],c=df['Clicked on Ad'])


#State the scatterplot labels
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

[45]: Text(0, 0.5, 'Second Principal Component')

7
[ ]: The two components are color-coded to delineate the outcome of Clicked on Ad
(Clicked/Did not click). Keep in mind that components don’t correspond to a
single variable but rather a combination of variables.
Finally, we can modify the code to add a color legend. This is a more advanced␣
↪set

of code and requires the use of a for-loop in Python and RGB color codes that␣
↪can

be found at Rapidtables.com.
Version 2: Visualized plot with color legend

[49]: plt.figure(figsize=(10,8))
legend = df['Clicked on Ad']
#Add indigo and yellow RGB colors
colors = {0: '#4B0082', 1: '#FFFF00'}
labels = {0: 'Clicked', 1: 'Did not click'}
#Use a for-loop to set color for each data point
for t in np.unique(legend):
ix = np.where(legend == t)

8
plt.scatter(scaled_pca[ix,0], scaled_pca[ix,1], c=colors[t],␣
↪label=labels[t])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.legend()
plt.show()

[ ]: From this visualization, we can see the clear separation of outcomes with the␣
↪aid

of a color legend in the top right corner. The output of PCA is now ready for
further analysis using a supervised learning technique such as logistic␣
↪regression

or k-nearest neighbors.

[ ]:

You might also like