0% found this document useful (0 votes)
96 views48 pages

Data Understanding and Preparation

- Retailer X provided four data files on customers, products, transactions, and a data dictionary to help understand the data - The data needs to be explored to determine if it can answer the client's questions and what relationships should be analyzed - Python packages like Pandas, NumPy, Matplotlib, and Scikit-learn will be used to import the data, perform exploratory analysis, and create visualizations - The data was imported into Pandas DataFrames and some quick exploration found there were 500 customers, 30 products, and 10,000 transactions

Uploaded by

MohamedYounes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views48 pages

Data Understanding and Preparation

- Retailer X provided four data files on customers, products, transactions, and a data dictionary to help understand the data - The data needs to be explored to determine if it can answer the client's questions and what relationships should be analyzed - Python packages like Pandas, NumPy, Matplotlib, and Scikit-learn will be used to import the data, perform exploratory analysis, and create visualizations - The data was imported into Pandas DataFrames and some quick exploration found there were 500 customers, 30 products, and 10,000 transactions

Uploaded by

MohamedYounes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Data understanding and

preparation
Retailer X has provided you with four files:
 Customer data in the csv format (Customer Data Set - Student 1 of 3.csv).
 Product data in the csv format (Product Data Set - Student 2 of 3.csv).
 Transaction data in the csv format (Transaction Data Set - Student 3 of 3.csv).
 The data dictionary Excel file (Retailer X Data Dictionary.xlsx), which gives a
short description for all the columns in the data files.
It is time to look through the data dictionary, open the dictionary file in Excel format, and
think about the following questions:
 Based on your business understanding (documented previously), do you have
the data that you need to answer the client’s questions?
 What relationships in the data do you want to explore?
 What extra fields do you think that you might need that you might be able to
derive?
If necessary, take time to revisit the business understanding section in the first course
and make any necessary adjustments that are based on the data that will be provided.
Here are the main Python packages that you are going to use in the code:
 NumPy (Numerical Python) is a linear algebra library in Python. It is an important
library on which almost every data science or machine learning Python package, such
as SciPy (Scientific Python), Mat−plotlib (plotting library), and Scikit-learn, depends.
 Pandas is a Python package that provides fast, flexible, and expressive data
structures that are designed to make working with “relational” or “labeled” data both
easy and intuitive. It is based on NumPy.
Pandas is one of the most useful data analyses libraries in Python. You mainly use
Pandas in performing the needed analysis of the code.
 Matplotlib is a Python plotting library. It is used to generate plots, histograms,
bar charts, scatterplots, and other metrics with a few lines of code.
 Scikit-learn is the popular Python machine learning library. It is built on NumPy,
SciPy, and matplotlib. With Scikit-learn, you can build classification, regression, and
clustering models and do other data preparation tasks.
You are going to load data into the code by using Pandas.
Before loading the data, you must understand the two key data structures in Pandas:
 A series is a one-dimensional labeled or indexed array. You can access
individual elements of a series through indexes.
 A data frame is like an Excel workbook. You have column names that refer to
columns and you have rows that can be accessed by using row numbers. The essential
difference is that column names and row numbers are known as column and row
indexes in data frames.
Series and data frames form the core data model for Pandas in Python. The data sets
are first read into these data frames and then various operations (such as group by or
aggregation) can be applied to their columns.
Now, it is time to write some code. You will be using the Jupyter notebook that you
uploaded into IBM Watson Studio in the previous section. This notebook has all the
code that you are going to explore in this course. Go through the following subsections
and perform the steps in them, that is, run the code in the notebook for each code step
and replicate the results that are shown.
The code part in this section is divided into three main subsections:
 Importing data files into your notebook: This subsection describes how to import
the data files into your code.
 Quick data exploration: This subsection describes column stats and column
types and converting from one data type to another one.
 Analysis of the distribution of variables by using graphs: This subsection
describes how to draw bar charts and histograms and join different data frames to
produce useful insights.
Importing data files into your code.
1. If your notebook is not open, select Assets from the home page, and scroll down
to Notebooks. Open your Data Understanding and Preparation notebook and select
“Data Understanding and Preparation.“ Change it to edit mode by clicking the pen icon.

2. On the right side of the window, click Find and add data icon (the one showing
the 1 and 0 logo). You find all the files that you uploaded listed on the right.
3. Import data files into your code as follows:
o You will find empty code cells made for you as per the below screenshot.
Click inside the code cell with the comment “#Import Product DataSet here”.

o Click the Product Dataset file in the right pane, click Insert to Code, and


then click Insert pandas DataFrames.
The Code cell is populated automatically with the import code.
Note: As you see the green box in the following figure, the Pandas library was
automatically imported.
4. Click Run next to your cell and observe the result

As you can see, the file has a pipe (|)


delimiter other than the default comma (,).
5. Specify the right delimiter to use in the “sep” argument. You can also give the
data frame a more descriptive name other than the default given names that were given
by the automatically generated Watson code.
6. Head back to the "#import Product DataSet" cell and replace the following lines

of code:
with the code that is shown below and click Run.

Note: You can copy the two lines of


code from the above code snippet.
7. Run the code again by using the head() function that you just modified to place
the data in a readable format.
8. Repeat steps 3 - 6 to import the data for transactions using the code cell with the
comment “#Import Transaction DataSet Here”. Make sure to replace the default data
frame df_data_1 and df_data_1.head() with “transactions_data” and use the “|”
separator as outlined in the following code. Again, these two lines of code replace the

original read and head commands.


9. Run your code. The output should be what is shown in the following table.

10. Repeat steps 3 - 6 one more time to import the data for customers using the
code cell with the comment “#Import Customer DataSet Here”. Make sure to replace the
default data frame df_data_1 and df_data_1.head() with the “customer_data”, as shown
in the following shaded box. You do not specify any delimiter to use here during the
import because the customer data file has the normal comma separator. Again, these
two lines of code replace the original read and head commands.

11. Run your code. The output is shown in the following table.

12. Refer to the “Retailer X Data Dictionary.xlsx” file for a description of each field
and its relevance to the business.
Quick Data Exploration
In this section, you explore some columns to determine the type of variables that are
associated with these columns, and learn about summary statistics like mean, median,
and standard deviation to describe their central tendency and dispersion. Complete the
following steps:
1. To discover the number of records in each file, use the shape function, which
returns a tuple that represents the dimensionality of the data frame. The returned result
has the following form:
(number of rows, number of columns)
2. Under the heading “Quick Data Exploration”, run the first three cell codes, which

are shown in the following figure. From


this output, you conclude that Retailer X sells 30 products and served 500 customers in
a total of 10,000 recorded transactions.
3. Verify that the files were read into a Pandas data frame by running the following
line. (You might have a different input number. In this figure, our input is shown In [22].)

Running the type command on a column like


“AGE” indicates that a column on its own is a Pandas Series type.
4. Look at the Pandas columns data types by using the dtypes command. Run the

customer_data.dtypes cell.
You might be confused about these data types and the variable types that were
explained earlier in this course. These data types (int64 or object) are the types of data
that are known to the programming language compiler, and they represent how they are
stored in memory.
Note: Object type is the same as a string type.
As you can see, Customer ID, Gender, Age, Experience Score, and Household Size are
numbers, and the other items, such as income and loyalty group, are strings. Income is
a string because the column has a dollar sign ($).
Later, you see how you can group columns and aggregate them with others. If you want
to do any mathematical operation on the income column, you must first remove the
dollar sign from it and then convert it to an int64 data type.

5. Remove the $ sign from INCOME by applying a string replace method. This task
must be done to each element in the column by using a map function. Run the code.

Here the Map function applies the lambda function to each element of a Pandas series.
Let us look at the data frame after applying this map function by looking at the following
figure.
INCOME is still in a string format and has a “,” in it. Remove it and convert it to integer
type by using the following piece of code.

The Int()
function is used to convert strings to the int data type.
6. Run the “dtypes” function again. It reveals that the data type conversion of

INCOME was successful.


7. Gather some statistics about the data columns by using the function describe()

The output is clearly different. Applying describe() to a string column produces four
statistics (count, unique, top, and freq), the most important of which is the unique stat,
which shows the number of different values of the string column.
For numeric data, the output of the describe() function includes count, mean, std, min,
max, and lower, 50, and upper percentiles. By default, the lower percentile is 25 and the
upper percentile is 75. The 50 percentile is the same as the median.
The Count stat, for both string and numeric types, shows the number of not-null values.

8. To know what unique values “categorical” string column has, run the unique()
function on the column. As you can see in the following figure, the output is an array of
all the unique values in this column.

9. The Enrolment Date column is a date and should be represented as a datetime


object. You change the data type of Enrollment date from object to a datetime object
like what you have done to the Income column, but first you must import the datetime
library.

Datetime.strptime is a method that returns a datetime that corresponds to date_string,


which is parsed according to a specific format.
Running this command without using this method produces an error because there are
null values in the column, which cannot undergo the datetime conversion. The reason
behind the presence of null values is not a data quality issue, but because some
customers are not enrolled in the loyalty program, they have no enrolment dates.
To bypass this error, apply only the conversion function to the not-null values by
applying a filter on the column and then converting it. Column filters in Pandas can be
created by using the following syntax:
DataFrame[column][Column Filter Condition]
10. Now, check the data types for the whole Pandas data frame. Note that the
Enrollment date column type has changed

11. Check whether all the data frames have null values, as shown in the following
figure
12. As you can see, only the Customer Data has null values, so check where these
null values are by running the command that is shown in the following figure.

It turns out that ENROLMENT DATE is the only column that has null values because
not all customers are enrolled into the loyalty program, so there is no enrolment date.
Hint: The data that is used in this tutorial is mostly free from data quality issues.
However, in a production environment, data scientists deal with data sets that have
quality issues that must be cleaned and corrected.

Analysis of the Distribution of Variables by using Graphs

So far, you have been presenting the data in a tabular format. In this section, you
present the data by using graphs. To do so, complete the following steps:

1. Import the necessary code libraries by scrolling down to the “Analysis of the
distribution of variables using graphs” heading in the notebook and run the library import
code that is shown in the following figure.

2. Inspect the distribution of the values of some of the columns and the
relationships between two columns at a time.

Univariate Analysis (single variable analysis)


Analyze one variable at a time by plotting frequency distributions by completing the
following steps:
1. To know the relative proportion of married customers to those who are not
married, use the value_counts() method to return the counts of the different statues
(categories). Then, use the plot function to draw a bar plot. Scroll down to the
“Univariate Analysis (Single variable analysis)” heading to see how this plot is created in
the code. More than
half of Retailer X customer base are married people, this might affect their choice of
products and affect their overall spending.
2. Let us also look at the customers age distribution by plotting a histogram that is
divided into 10 bins (intervals), as shown in the following figure.

3. Another option is to draw a box plot for the age and show how this variable is

dispersed
4. Compare the results that are shown in the box plot to the Age variable summary

statistics by using the describe command.

Creating a Customer view


There are two important objectives that were set during the Business Understanding
phase.

Money was spent buying Retailer X items, the Transaction data must be joined with
Products because transaction data does not indicate the unit price of the sold products;
It just refers to the product number, not the product type.
To join the two data sets, complete the following steps:
1. Create a Pandas frame that is called “trans_products” by joining
transactions_data and product_data. Use the merge command and specify the join
method as inner and specify the columns to join.

2. You might need to change the type of Unit List Price column as you did with the
income in customer_data because it was still in string format.
3. Check the data type by using the dtypes function

4. Derive a total price column for each transaction by multiplying quantity by unit
price, and subtracting any discounts taken

5. You now have the total price of each transaction. If you sum up all the values,
you get the total revenue for Retailer X. This sum might be a good performance
indicator that you can report to Retailer X, but it is better if you can report the revenue
per product category. To do so, first group by “Product Category” column, then
aggregate by summing the total_price column. You then sort in descending order based
on the revenue. Use the “groupby”, “agg” and “sort_values” functions for this task.

6. Show this data as a pie chart and present it to Retailer X, but rename the
Total_Price column to a more meaningful name.
7. Then, draw the pie chart and display the percentage of revenue for each product
category.

8. You might think that the customer spending might affect loyalty program
participation, as in “the more a customer spends, the more likely that they will enroll in
the loyalty program”. Let us calculate for each customer the following measures, and
see whether any of these measures affects loyalty enrolment:

o Total spends per category


o Total spends
o Most recent transaction dates
o Average Discount taken

Total spends per category


Calculate the total spend per category by performing the following steps:
1. Run the following code.

The groupby output has an index or multi-index on rows that correspond to your chosen
grouping variables

2. Use the columns function to list the columns of a Pandas data frame.

3. Note that only Total_Price appears as the single column in the data frame, and
that the other two columns, “Customer Num” and” Product category”, were not listed.
This occurs because when you group by some columns, they change from a column to
a multi- or a hierarchical index. To revert them back, use the reset_index() method.

4. Make it permanent by running the following code

5. Let us reorganize the data and create a summary report that shows the spending
on each product category per customer. Use the “Pivot” function to create a pivot table.

The Pivot function needs three main arguments:


o The index, which sets the column to use as an index.
o The columns, which are the pivot columns to use.
o The values to populate new frame’s values.
NaN indicates that the customer did not buy any of this category. You later replace it
with a zero value.

Total spends and most recent transactions by date

Now, calculate the latest transaction date and the total spend for each customer. You
can calculate both in the same step because both measures are at the same level of
aggregation (which is per customer).

1. Convert the transaction date to a datetime object so that you can use an
aggregate function.

2. Calculate the measures by using the “groupby” and “agg” functions

3. Join the newly created Pandas data frame (recent_trans_total_spend) with the
customer pivot table to create a view for each customer that contains the following
information:
o Total spend
o Total spend per category
o Recent transaction date
4. Create the join on the common index, which is Customer NUM.

5. Replace the Null values with zeros by using the fillna function.

6. Augment this view with the customer original data set to provide a holistic view
for Retailer X about its customers by performing another join.
This is the customer view on which you are going to work.

Bivariate Analysis (2-variable analysis): Loyalty as a target variable

Now that you have all what you want in one Pandas frame, conduct some bivariate
analysis to analyze the relationship between the different variables and loyalty
enrollment (the target variable) to test whether there is some association. For example,
if you find a significant relationship between age and loyalty, then age is likely a strong
predictor of customers enrolling into a loyalty program.
You use a Pandas crosstab to compute a simple cross-tabulation, which is a tool that
you can use to compare the relationship between two variables in tabular form,
especially categorical variables. You can present this cross tabulation easily in a
stacked bar chart.

Cross-tabulating gender with loyalty


Complete the following steps:

1. Start by cross-tabulating the Gender column with Loyalty by running the following
code.
2. The numbers that are given in the table represent the frequency counts for the
two factors together. You can plot this table in a bar chart by simply running a plot

command.
Question: Does gender affect loyalty enrollment?

By visually inspecting the chart, apparently being male or female does not affect the
enrollment much. The ratio of enrolled to non-enrolled is the same for both genders
(which looks like 1:1).Both genders are likely to join with the same probability.

Experience score

Let us see whether the experience score affects customer enrolment. Experience score
is the store experience survey result (1 = not satisfied with experience, 10 = highly
satisfied with experience). Run the following code.
Question: Does the experience score affect loyalty enrollment?

It appears so because customers with experience scores below 5 (1 - 4) did not enroll at
all. However, customers with scores 5 or more are likely to enroll. So, knowing a
customer’s experience score can predict the likelihood of a customer enrolling. If a
customer has a score below 5 score, then they are not likely to enroll at all. However, if
their experience score is 5 or more, they are likely to join with a probability of 60% -
70%.
Marital status
Do the same cross-tabulation for marital status by running the following code.

Question: Does marital status affect loyalty enrollment?


Apparently, marital status does not have much effect on loyalty enrollment. The ratio of
enrolled to non-enrolled appears to be almost the same for all marital statuses,
especially for married and singles (who are most of Retailer X’s customer base).

Age

Age is a continuous variable. To see the significance of age regarding loyalty


enrollment, bin several age groups, and for each one compute the number of enrolled
versus not-enrolled.
The Pandas cut command automatically bins values into discrete intervals. You use it to
derive a new Pandas column that is called “Age_Binned”, which represents the age
group of the customer. In this exercise, you create 10 age groups (bins).

Question: Does age affect loyalty enrollment?

Based on the above chart, middle aged customers are less likely to join loyalty
programs, and younger and elderly people are more likely to join. This piece of
information reveals that there is some sort of significant relationship between age and
loyalty enrollment. Let us see whether we can discover this relationship by completing
the following steps:

1. Using simple analysis, compare the means of the ages for both enrolled and not-
enrolled customers by calculating an average age for each group. However, this

analysis does not discover much information because it compares only two group’s
means.Based on the results, you can say that enrolled customers have a slightly higher
average age than their non-enrolled counterparts.

2. This was a comparison of one summary statistic, which was the mean between
two groups of enrollments. To see the larger picture, draw the distribution of age for the
two groups side by side by using box plots.
The distribution of age for enrolled customers is more dispersed than the same for non-
enrolled ones. Enrolled customers have a higher median and quartiles, and a bigger
IQR range than their non-enrolled counterparts.
The chart shows that 75% of customers that are enrolled in the loyalty program are
below the age of 55, and 75% of non-enrolled customers are below the age of 43.

Total Spend
As total spend is a continuous variable, divide it into 10 discrete intervals and plot them
as a stacked bar chart.

Question: Does the total spending of customers affect their loyalty enrollment?
As an overall pattern, the graph shows that as the total spend of customer increase, so
does their chances of enrollment as well.

Exercise: Investigating the relationship between loyalty enrollment and


household size

You have tried to understand what factors might affect participation in the loyalty
program. Now, you will understand what factors affect the total spending of a customer.

Bivariate Analysis (2-variable analysis) – Customer spend as a target variable


Conduct some bivariate analysis to analyze the relationship between some variables
and customer spending (the target variable) to test whether there is an association.
Age
Age is a continuous variable, as is the total spend. Display them together in a scatter plot.

Question: Based on this graph, does age influence total spending?

To some degree, yes. If you look at the figure, you find that there is some positive
correlation between both variables, that is, the total spend of a customer increases as
their age increases.
The correlation between them can be quantified by a coefficient that is called the
Pearson correlation coefficient. This correlation coefficient is a statistical measure that
calculates the strength of the relationship between the relative movements of the two
numeric variables. The range of values for the correlation coefficient is -1.0 - 1.0. A
correlation of -1.0 shows a perfect negative correlation, and a correlation of 1.0 shows a
perfect positive correlation. A correlation of 0.0 shows zero or no relationship between
the movement of the two variables.
Use the “pearsonr” library to calculate the Pearson Correlation coefficient without
considering the underlying math.
The correlation coefficient is 0.576, which implies a moderately strong correlation between
both factors.

Income
Income is a continuous variable. Create a scatter plot to show the relationship between
income and total spend, and then calculate the Pearson correlation coefficient.

Income has a higher correlation coefficient than age, which indicates that it has a strong
relationship with customer spending. This finding is logical: Anyone would expect that
customer spending is heavily dependent on their income.

Experience score
Examine the relationship between the experience score (which is a categorical feature)
and total spend (which is continuous). The experience score has 10 distinct scores with
a range of 1 - 10. For each score (category), calculate the mean customer total spend
by running a “groupby”, and then an agg command. Plot the results in a bar chart.

Obviously, customers with experience scores 1 - 4 have a relatively lower average spend
than customers with higher experience scores (5 - 10). This indicates some sort of
relationship between the two variables.

Writing code for clustering


The comment “#Begin Writing your code here” below the “Clustering” heading marks
the location where you start, as shown in Figure 1
Figure 1
To apply the K-means and hierarchical agglomerative algorithms to cluster data,
complete the following steps:
1. Import the necessary libraries by using the following code:
from sklearn.cluster import KMeans from sklearn.cluster import
AgglomerativeClustering
2. Select the features on which you are clustering. In this example, we cluster
“income” and “Total spent” variables by using the following code:
cluster_input=customer_all_view[['INCOME','TOTAL SPENT']]
cluster_input.head(5)
The “cluster_input” variable is a Pandas data frame that contains only the columns
“income” and “total spent”. We use these two continuous variables because of the
following reasons:
o Two variables can be easily visualized on a 2-dimensional plot
o Clustering algorithms rely on a distance function (like Euclidean distance)
to compute similarity among data points. The sample space for categorical data is
discrete and doesn't have a natural origin, so a Euclidean distance function on such a
space isn't really meaningful.
3. Initialize a K-means model with four clusters as follows:
Kmeans_model=KMeans(n_clusters=4)
Although you can use the elbow method or silhouette to determine the optimal number
of clusters, you and Retailer X agreed to divide their customer base into only four
clusters.
4. Look at the parameters of the model by running the following code:
Kmeans_model
Here is the output:

The output shows the parameters that were passed to the model, which decide how the
algorithm works because we passed only the number of cluster(k) parameter; every
other parameter used its default value.
The K-means algorithm follows an iterative way for clustering data points. The number
of iterations default value is 300, which is visible Out (23) output. This parameter
defines the maximum iterative limit because the algorithm can reach convergence
before reaching this maximum limit.
For more information about parameters, visit http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
5. Run the K-means cluster algorithm on the input by using “fit_predict” method:
cluster_output = Kmeans_model.fit_predict(cluster_input)
6. Get the output of the K-means algorithm by using the following code:
cluster_output
Here is the output:

The numbers in the output array represent the cluster index for each sample. Indexes
are in the range of [0,1,2,3] because you specified four clusters. Samples with the same
index belong to the same cluster.
7. The output of step 6 is in a NumPy array type. Run the following command to
confirm that the output is in NumPy format and not Pandas.
type(cluster_output)
Here is the output:

Recall that we used the Pandas


data frame structure to store the tabular data that is in the CSV files (products,
customers, and transactions). Let us look at the first few records of products by using
the following code:
product_data.head()
Here is the output:

If
you run the values method on a Pandas dataframe, it will convert it to Numpy. Run it on
Product Pandas and observe the output.
product_data.head().values
Here is the output:

The output clearly indicates how Pandas and NumPy are closely related. Pandas is an
arrangement of data in rows and columns like normal tables. NumPy is a
multidimensional array object for a 2-dimensional array. NumPy stores data in a row
and column format, as shown in Figure 2.

Figure 2. The rows are indicated as


the “axis 0”, while the columns are the “axis 1”.To access particular elements in that
array, you use square brackets ([]) to index array values

For example, if you need to know the value represented by Row 1 (second row) and
column 2 (third column), You should use the index [1,2]
To access the element in the second row and third column, use the following code:
product_data.head().values[1,2]
Here is the output

To view all elements presented in the


second row, use “:” for the column index to call all columns
product_data.head().values[1,:]
Here is the output

Values in the output are presented in a 1-Dimensional array, since we only called a
single row of data
To view the values of the third column, use the following code:
product_data.head().values[:,2]
Here is the output

Values in the output are presented in a 1-dimensional array because we called only a
single column of data.
You can convert the 1-dimensional NumPy array to a Pandas data frame by using the
following code:
cluster_output_pd=pd.DataFrame(cluster_output,columns=['segment'])
The “cluster_output” is a 1-dimensional array because a single cluster index is assigned
to every customer record.
Verify that “cluster_output_pd” is a Pandas data frame with a single column called
“segment” by using the following code:
cluster_output_pd.head()

8. Merge the cluster input containing the income and total spending for each
customer and the cluster output, which contains the cluster index, by using the following
code:
segment_DF=pd.concat([cluster_input,cluster_output_pd],axis=1)
9. Verify output using the following command
segment_DF.head()
10. The cluster centroids that are computed by the algorithm can be found by using a
method that is called “cluster_centers”. Apply the method to the K-means model by
using the following code:
Kmeans_model.cluster_centers_
Here is the output:

As
shown in the Out[83] output, the first centroid is at Income =39991 and total
spend=2424.8”. In step 14, you present the clusters and their centroids as a 2-
dimensional scatter plot to visualize the shape of the clusters.
11. The “segment_DF” Pandas data frame that is created in step 8 contains points
that belong to all customer segments. To select only those segments that belong to the
first cluster (cluster index=0), use the following code:
segment_DF[segment_DF.segment==0].head()
Condition “segment==0” selects only those assigned to the first cluster. Similarly, you
can do the same for other clusters
12. Plot the clustering results. First, import the plotting library by using the following
code:
import matplotlib.pyplot as plt
13. Use the filter condition that was applied in step 10 to select only those customers
that belong to the first cluster. Plot their income and age in purple. Select the second
cluster and plot their income and age in blue. Repeat for clusters 3 and 4. Plot the
cluster centroids that you have computed in step 9. .
The following code snippet contains many lines of code. The first four lines plot the
points that belong to each cluster (0,1,2,3), the fifth line plots the cluster centroids, and
the remaining lines set the title, legend, and the axis names for the graph. Copy the
code, run it, and observe the scattered cluster plot, which is shown in Figure 3
plt.scatter(segment_DF[segment_DF.segment==0]
['INCOME'],segment_DF[segment_DF.segment==0]['TOTAL SPENT'],s=50,
c='purple',label='Cluster1') plt.scatter(segment_DF[segment_DF.segment==1]
['INCOME'],segment_DF[segment_DF.segment==1]['TOTAL SPENT'],s=50,
c='blue',label='Cluster3') plt.scatter(segment_DF[segment_DF.segment==2]
['INCOME'],segment_DF[segment_DF.segment==2]['TOTAL SPENT'],s=50,
c='green',label='Cluster4') plt.scatter(segment_DF[segment_DF.segment==3]
['INCOME'],segment_DF[segment_DF.segment==3]['TOTAL SPENT'],s=50,
c='cyan',label='Cluster2') plt.scatter(Kmeans_model.cluster_centers_[:,0],
Kmeans_model.cluster_centers_[:,1],s=200,marker='s', c='red', alpha=0.7,
label='Centroids') plt.title('Customer segments using K-means (k=4)')
plt.xlabel('Income') plt.ylabel('Total Spend') plt.legend() plt.show()

Figure 3: Notice the position of the cluster centroids: They lie in the middle of each
cluster.
14. Using the graph that was produced in step 12, create a table to describe the four
customer segments relative to their income and spend.
The following table illustrates the relative income and spending for the four different
customer segments of Retailer X.
15. Retailer X must know all about the different customer segments demographics.
So, you must discover the characteristics that are associated with each segment, such
as the age group, household size, and loyalty enrolment. To do this task, group by each
customer segment and calculate group measures such as average age, percentage of
loyalty enrolment, and median of house hold size.
Merge the clustering output with the customer all view by using the following code:
customer_demographics=pd.concat([customer_all_view,cluster_output_pd],axis=1)
customer_demographics.head()
Here is the output:

Regarding the “Age” and “household size” aggregations, the calculation is


straightforward and can be carried out like was done in the previous course. Using the
group by and simple aggregate functions, use the following code:
customer_demographics.groupby('segment').agg({'AGE':'mean','HOUSEHOLD
SIZE':'median'})
Here is the output:
This output shows the mean age and median household size for each cluster. Notice
how age varies significantly across segments.
With regards to loyalty enrolment, you must calculate the percentage of participation by
using the following formula:

So, you create a function by the name “percent_loyalty” by using the following code:
def percent_loyalty(series): percent=100 * series.value_counts()
['enrolled'] /series.count() return percent
This function accepts the Pandas series as input, which is the series that we are going
to pass as the “loyalty group” column (one Pandas data frame column is a Pandas
series). The function returns the percentage of enrolment by calculating the number of
enrolled customers (by using the value_counts method) to the total number of
customers (by using the count function).
Pass the created function as an aggregate measure in the “agg” command as follows:
customer_demographics.groupby('segment').agg({'AGE':'mean','HOUSEHOLD
SIZE':'median','LOYALTY GROUP': percent_loyalty})
Here is the output

Extend the tabular report that you constructed in step 15 to include the segment
demographic data you produced
This table shows the demographic segmentation for Retailer X customer segments.
Demographic
segmentation for Retailer X customer segments
Save Notebook
While your notebook is opened, click File, and then Download as and select Notebook
(.ipynb) and save it on your local machine as shown in Figure 4. This is a cautious step in
case you delete the notebook or the whole project by accident.

Figure 4
Exercise
Like what was done by using the K-means cluster model ("Writing code for clustering
AIs"), use the agglomerative hierarchical clustering method to cluster data and provide
demographic segmentation to Retailer X by using this cluster method. Use the same
steps from that section with the following exceptions:
1. Use the AgglomerativeClustering library that was imported in step 1 to the cluster
instead of the KMeans library by using the following code:
AgglomerativeClustering_model=AgglomerativeClustering(n_clusters=4)
2. There are no centroids defined for this type of clustering
As for the rest of the steps, they will remain the same
Conclusion
You used the K-means and hierarchical clustering algorithms to cluster customer data
based on Income and total spending. The resulting clusters were evaluated by data that
was not used for clustering, such as age, household size, and loyalty enrollment.

You might also like