0% found this document useful (0 votes)
1 views

Task 2 Exploratory Data Analysis

The document outlines a task for exploratory data analysis (EDA) on customer and pricing data to investigate the correlation between price sensitivity and churn. It includes three subtasks: performing EDA to understand the dataset, verifying the price sensitivity hypothesis, and preparing a summary with suggestions for data augmentation. The initial analysis reveals no missing values or duplicate records in the client dataset, which consists of 14,606 entries and 28 columns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Task 2 Exploratory Data Analysis

The document outlines a task for exploratory data analysis (EDA) on customer and pricing data to investigate the correlation between price sensitivity and churn. It includes three subtasks: performing EDA to understand the dataset, verifying the price sensitivity hypothesis, and preparing a summary with suggestions for data augmentation. The initial analysis reveals no missing values or duplicate records in the client dataset, which consists of 14,606 entries and 28 columns.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Task 2: Exploratory Data Analysis

Background information
The BCG project team thinks that building a churn model to understand whether price
sensitivity is the largest driver of churn has potential. The client has sent over some data and
the AD wants you to perform some exploratory data analysis.

The data that was sent over includes:

 Historical customer data: Customer data such as usage, sign up date,


forecasted usage etc
 Historical pricing data: variable and fixed pricing data etc
 Churn indicator: whether each customer has churned or not

Task
Sub-Task 1:
Perform some exploratory data analysis. Look into the data types, data
statistics, specific parameters, and variable distributions. This first subtask is
for you to gain a holistic understanding of the dataset. You should spend
around 1 hour on this.

Sub-Task 2:
Verify the hypothesis of price sensitivity being to some extent correlated with
churn. It is up to you to define price sensitivity and calculate it. You should
spend around 30 minutes on this.

Sub-Task 3:
Prepare a half-page summary or slide of key findings and add some suggestions
for data augmentation – which other sources of data should the client provide
you with and which open source datasets might be useful? You should spend
10-15 minutes on this.
Sub-Task 1:

# load packages
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

import matplotlib.pyplot as plt


import seaborn as sns

from sklearn.preprocessing import LabelEncoder

client dataset feature descriptions:


# load client dataset
client = pd.read_csv('../input/bcginternshipprogram/client_data.csv')
client.head(3)

OUTPUT:

Have a look at the general information on the client dataset


client.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14606 entries, 0 to 14605
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 14606 non-null object
1 channel_sales 14606 non-null object
2 cons_12m 14606 non-null int64
3 cons_gas_12m 14606 non-null int64
4 cons_last_month 14606 non-null int64
5 date_activ 14606 non-null object
6 date_end 14606 non-null object
7 date_modif_prod 14606 non-null object
8 date_renewal 14606 non-null object
9 forecast_cons_12m 14606 non-null float64
10 forecast_cons_year 14606 non-null int64
11 forecast_discount_energy 14606 non-null float64
12 forecast_meter_rent_12m 14606 non-null float64
13 forecast_price_energy_off_peak 14606 non-null float64
14 forecast_price_energy_peak 14606 non-null float64
15 forecast_price_pow_off_peak 14606 non-null float64
16 has_gas 14606 non-null object
17 imp_cons 14606 non-null float64
18 margin_gross_pow_ele 14606 non-null float64
19 margin_net_pow_ele 14606 non-null float64
20 nb_prod_act 14606 non-null int64
21 net_margin 14606 non-null float64
22 num_years_antig 14606 non-null int64
23 origin_up 14606 non-null object
24 pow_max 14606 non-null float64
25 churn 14606 non-null int64

dtypes: float64(11), int64(7), object(8)


memory usage: 2.9+ MB

There are 4 features related to date, it is better to convert them to datetime data type.
# convert datetime feature to datetime data type
for f in ['date_activ','date_end','date_modif_prod','date_renewal']:
client[f] = pd.to_datetime(client[f])

Add some new features from the above datetime features


#client['contract_start_year'] = client['date_activ'].dt.year
client['contract_end_year'] = client['date_end'].dt.year

First, check missing values and duplicate records.

# define a function to display missing values and dupliate rows


def duplicate_and_missing(dataset, dataset_name):
print('There are', dataset.shape[0], 'rows and', dataset.shape[1],
'columns in the dataset', '"'+dataset_name+'"','\n'+'--'*40)
# display missing values
if dataset.isna().sum().sum()!=0: # if there is missing values
missing_value = dataset.isna().sum()[dataset.isna().sum()!
=0].to_frame(name='count')
missing_value['proportion'] =
missing_value['count']/len(dataset)
print('There are', dataset.isna().sum().sum(), 'missing values')
print(missing_value, '\n'+'--'*40)
else:
print('There is no missing value')
# display duplicate rows
if dataset.duplicated().sum()!=0:
print('There are', dataset.duplicated().sum(), 'duplicate rows\n')
else:
print('There is no duplicate row\n')

There is no missing value or duplicate rows.


duplicate_and_missing(dataset=client, dataset_name='Client')
There are 14606 rows and 28 columns in the dataset "Client"
--------------------------------------------------------------------------------
There is no missing value
There is no duplicate row

You might also like