BCG Internship Task 2
BCG Internship Task 2
Data Wrangling
In [1]: ## Importing neccesary packages into the notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]: ## Importing the CSV from file directory into the notebook
cust_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_data.csv")
cust_data.head()
Out[3]:
id activity_new campaign_disc_ele
5 rows × 32 columns
cust_data.duplicated("id")
Out[4]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool
price_data.head()
Out[5]:
id price_date price_p1_var price_p2_var price_p3_var pric
2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01
2015-02-
1 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01
2015-03-
2 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01
2015-04-
3 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01
2015-05-
4 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01
price_data.duplicated("id")
Out[6]: 0 False
1 True
2 True
3 True
4 True
...
192997 True
192998 True
192999 True
193000 True
193001 True
Length: 193002, dtype: bool
price_data_new
Out[7]:
id price_date price_p1_var price_p2_var price_p3_var
2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.000000 0.000000
01
2015-01-
12 31f2ce549924679a3cbb2d128ae9ea43 0.125976 0.103395 0.071536
01
2015-01-
24 36b6352b4656216bfdb96f01e9a94b4e 0.123086 0.100505 0.068646
01
2015-01-
36 48f3e6e86f7a8656b2c6b6ce2763055e 0.144431 0.000000 0.000000
01
2015-01-
48 cce88c7d721430d8bd31f71ae686c91e 0.153159 0.130578 0.098720
01
2015-01-
192942 cd622263c26436d1237e94ff05cdd506 0.151367 0.000000 0.000000
01
2015-01-
192954 ed3434c3c1e2056d1a313e2671815e4d 0.128069 0.105843 0.073773
01
2015-01-
192966 d00da2c0c568614b9937791f681cd7d7 0.150211 0.000000 0.000000
01
2015-01-
192978 045f94f0b7f538a8d8fae11080abb5da 0.151367 0.000000 0.000000
01
2015-01-
192990 16f51cdc2baa19af0b940ee1b3dd17d5 0.129444 0.106863 0.075004
01
price_data_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 192990
Data columns (total 8 columns):
id 16096 non-null object
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(6), object(2)
memory usage: 1.1+ MB
We can see missing values on the prices column beacuse the length of each of the columns was short
by 36. We will deal with missing values later.
Data Transformation
In [9]: # To check if the id columns in the new pricing data is same with the customer
data preparing for merging the two data.
check_data_columns = price_data_new.id.isin(cust_data.id).astype(int)
check_data_columns
Out[9]: 0 1
12 1
24 1
36 1
48 1
..
192942 1
192954 1
192966 1
192978 1
192990 1
Name: id, Length: 16096, dtype: int32
check_data_columns.value_counts()
Out[10]: 1 16096
Name: id, dtype: int64
The length of the check is similar to the id length in both new pricing data and customer data. so we can
merge them.
Out[11]:
id churn
0 48ada52261e7cf58715202705a0451c9 0
1 24011ae4ebbe3035111d65fa7c15bc57 1
2 d29c2c54acc38ff3c0614d0a653813dd 0
3 764c75f661154dac3a6c254cd082ea7d 0
4 bba03439a292a1e166f80264c16191cb 0
churn_data.duplicated("id", keep=False)
Out[12]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool
In [13]: # check if the id column of the churn data is same as the id column of the cus
tomer data.
check_data_columns2 = churn_data.id.isin(cust_data.id).astype(int)
check_data_columns2
Out[13]: 0 1
1 1
2 1
3 1
4 1
..
16091 1
16092 1
16093 1
16094 1
16095 1
Name: id, Length: 16096, dtype: int32
In [14]: check_data_columns2.value_counts()
Out[14]: 1 16096
Name: id, dtype: int64
Churn data is similar in length of id column to customer data. We can merge both dataframe to a single
data frame.
In [15]: ## Merging the churn data and the customer data for data wrangling.
Out[15]:
id activity_new campaign_disc_ele
Out[16]:
id activity_new campaign_disc_ele
powerco_data_merge.head()
Out[17]:
id activity_new campaign_disc_ele
5 rows × 40 columns
powerco_data_merge.describe(include="all")
Out[18]:
id activity_new campaign_disc_ele
11 rows × 40 columns
In [19]: ## Understanding the collumn data types and missing values in the collumn
powerco_data_merge.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 40 columns):
id 16096 non-null object
activity_new 6551 non-null object
campaign_disc_ele 0 non-null float64
channel_sales 11878 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 3508 non-null float64
forecast_base_bill_year 3508 non-null float64
forecast_bill_12m 3508 non-null float64
forecast_cons 3508 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 15970 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 15970 non-null float64
forecast_price_energy_p2 15970 non-null float64
forecast_price_pow_p1 15970 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16083 non-null float64
margin_net_pow_ele 16083 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16081 non-null float64
num_years_antig 16096 non-null int64
origin_up 16009 non-null object
pow_max 16093 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(22), int64(7), object(11)
memory usage: 5.0+ MB
Columns like:
activity_new,channel_sales,date_first_activ,date_modif_prod,forecast_base_bill_ele,forecast_base_bill_year,foreca
have very noticeable missing values. We have to investigate the reason for missing values and if we need to let
the column.</span>
check_activity_new = powerco_data_merge[["activity_new"]]
check_activity_new
Out[20]:
activity_new
0 esoiiifxdlbkcsluxmfuacbdckommixw
1 NaN
2 NaN
3 NaN
4 NaN
... ...
16091 NaN
16092 NaN
16093 NaN
16094 NaN
16095 NaN
In [21]: check_activity_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
activity_new 6551 non-null object
dtypes: object(1)
memory usage: 251.5+ KB
activity_new is the category of the companys activity. Since there data in the column is encrypted, and
there is no way to know the meaning of the data and so we cannot perform a transformation and operate
on the column. We consider deleting the column. Same with the campaign_disc_ele whose elements are
NaN
Out[22]:
campaign_disc_ele
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
16091 NaN
16092 NaN
16093 NaN
16094 NaN
16095 NaN
In [23]: check_campaign_disc_ele.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
campaign_disc_ele 0 non-null float64
dtypes: float64(1)
memory usage: 251.5 KB
del powerco_data_merge["activity_new"]
Checking the data to see the exit of the two irrelevant columns due to excessive missing values.
In [26]: powerco_data_merge
Out[26]:
id channel_sales cons_12m cons_ga
1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0
In [27]: # Column Channel sales( code of the channel sales) also have missing values
channel_sales=powerco_data_merge[["channel_sales"]]
In [28]: channel_sales.isnull()
Out[28]:
channel_sales
0 False
1 False
2 True
3 False
4 False
... ...
16091 False
16092 False
16093 False
16094 False
16095 True
we have 4218 number of missing values of the channel_sales . We will attempt to replace the missing
values by forward fill.
In [29]: # We use the fillna method to fill forward without limit so we can have all pl
aces in the column fixed.
channel_sales1 = powerco_data_merge[["channel_sales"]].fillna(method="pad")
channel_sales1
Out[29]:
channel_sales
0 lmkebamcaaclubfxadlmueccxoimlema
1 foosdfpfkusacimwkcsosbicdxkicaua
2 foosdfpfkusacimwkcsosbicdxkicaua
3 foosdfpfkusacimwkcsosbicdxkicaua
4 lmkebamcaaclubfxadlmueccxoimlema
... ...
16091 foosdfpfkusacimwkcsosbicdxkicaua
16092 foosdfpfkusacimwkcsosbicdxkicaua
16093 foosdfpfkusacimwkcsosbicdxkicaua
16094 foosdfpfkusacimwkcsosbicdxkicaua
16095 foosdfpfkusacimwkcsosbicdxkicaua
In [30]: powerco_data_merge["channel_sales"]=channel_sales1
powerco_data_merge
Out[30]:
id channel_sales cons_12m cons_ga
1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0
Out[31]: id 0
channel_sales 0
cons_12m 0
cons_gas_12m 0
cons_last_month 0
date_activ 0
date_end 2
date_first_activ 12588
date_modif_prod 157
date_renewal 40
forecast_base_bill_ele 12588
forecast_base_bill_year 12588
forecast_bill_12m 12588
forecast_cons 12588
forecast_cons_12m 0
forecast_cons_year 0
forecast_discount_energy 126
forecast_meter_rent_12m 0
forecast_price_energy_p1 126
forecast_price_energy_p2 126
forecast_price_pow_p1 126
has_gas 0
imp_cons 0
margin_gross_pow_ele 13
margin_net_pow_ele 13
nb_prod_act 0
net_margin 15
num_years_antig 0
origin_up 87
pow_max 3
churn 0
price_date 0
price_p1_var 36
price_p2_var 36
price_p3_var 36
price_p1_fix 36
price_p2_fix 36
price_p3_fix 36
dtype: int64
In [32]: # lets replace the all the numeric series with their mean
mean_price_p1_var = powerco_data_merge["price_p1_var"].sum()/16096
powerco_data_merge["price_p1_var"]=powerco_data_merge["price_p1_var"].fillna(m
ean_price_p1_var)
mean_price_p2_var = powerco_data_merge["price_p2_var"].sum()/16096
powerco_data_merge["price_p2_var"]=powerco_data_merge["price_p2_var"].fillna(m
ean_price_p2_var)
mean_price_p3_var = powerco_data_merge["price_p3_var"].sum()/16096
powerco_data_merge["price_p3_var"]=powerco_data_merge["price_p3_var"].fillna(m
ean_price_p3_var)
# p1_fix
mean_price_p1_fix = powerco_data_merge["price_p1_fix"].sum()/16096
powerco_data_merge["price_p1_fix"]=powerco_data_merge["price_p1_fix"].fillna(m
ean_price_p1_fix )
# p2_fix
mean_price_p2_fix = powerco_data_merge["price_p2_fix"].sum()/16096
powerco_data_merge["price_p2_fix"]=powerco_data_merge["price_p2_fix"].fillna(m
ean_price_p2_fix )
mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096
powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)
# price_p3_fix
mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096
powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)
# margin_gross_pow_ele
mean_margin_gross_pow_ele = powerco_data_merge["margin_gross_pow_ele"].sum()/
16096
powerco_data_merge["margin_gross_pow_ele"]=powerco_data_merge["margin_gross_po
w_ele"].fillna(mean_margin_gross_pow_ele)
# margin_net_pow_ele
mean_margin_net_pow_ele = powerco_data_merge["margin_net_pow_ele"].sum()/160
96
powerco_data_merge["margin_net_pow_ele"]=powerco_data_merge["margin_net_pow_el
e"].fillna(mean_margin_net_pow_ele)
# net_margin
mean_net_margin = powerco_data_merge["net_margin"].sum()/16096
powerco_data_merge["net_margin"]=powerco_data_merge["net_margin"].fillna(mean_
net_margin)
# pow_max
mean_pow_max = powerco_data_merge["pow_max"].sum()/16096
powerco_data_merge["pow_max"]=powerco_data_merge["pow_max"].fillna(mean_pow_ma
x)
# origin_up
powerco_data_merge["origin_up"]= powerco_data_merge["origin_up"].fillna(method
="pad")
# forecast_base_bill_ele
powerco_data_merge["forecast_base_bill_ele"]= powerco_data_merge["forecast_bas
e_bill_ele"].fillna(value=0)
# forecast_base_bill_year
powerco_data_merge["forecast_base_bill_year"]= powerco_data_merge["forecast_ba
se_bill_year"].fillna(value=0)
# forecast_bill_12m
powerco_data_merge["forecast_bill_12m"]= powerco_data_merge["forecast_bill_12
m"].fillna(value=0)
#forecast_cons
powerco_data_merge["forecast_cons"]= powerco_data_merge["forecast_cons"].filln
a(value=0)
# forecast_discount_energy
mean_forecast_discount_energy = powerco_data_merge["forecast_discount_energ
y"].sum()/16096
powerco_data_merge["forecast_discount_energy"]= powerco_data_merge["forecast_d
iscount_energy"].fillna(mean_forecast_discount_energy )
# forecast_price_energy_p1
mean_forecast_price_energy_p1 = powerco_data_merge["forecast_price_energy_p
1"].sum()/16096
powerco_data_merge["forecast_price_energy_p1"]= powerco_data_merge["forecast_p
rice_energy_p1"].fillna(mean_forecast_price_energy_p1)
# forecast_price_energy_p2
mean_forecast_price_energy_p2 = powerco_data_merge["forecast_price_energy_
p2"].sum()/16096
powerco_data_merge["forecast_price_energy_p2"]= powerco_data_merge["forecast_p
rice_energy_p2"].fillna(mean_forecast_price_energy_p2)
# forecast_price_pow_p1
mean_forecast_price_pow_p1 = powerco_data_merge["forecast_price_pow_p1"].
sum()/16096
powerco_data_merge["forecast_price_pow_p1"]= powerco_data_merge["forecast_pric
e_pow_p1"].fillna(mean_forecast_price_energy_p2)
powerco_data_merge.isnull().sum()
powerco_data_merge.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 38 columns):
id 16096 non-null object
channel_sales 16096 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 16096 non-null float64
forecast_base_bill_year 16096 non-null float64
forecast_bill_12m 16096 non-null float64
forecast_cons 16096 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 16096 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 16096 non-null float64
forecast_price_energy_p2 16096 non-null float64
forecast_price_pow_p1 16096 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16096 non-null float64
margin_net_pow_ele 16096 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16096 non-null float64
num_years_antig 16096 non-null int64
origin_up 16096 non-null object
pow_max 16096 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16096 non-null float64
price_p2_var 16096 non-null float64
price_p3_var 16096 non-null float64
price_p1_fix 16096 non-null float64
price_p2_fix 16096 non-null float64
price_p3_fix 16096 non-null float64
dtypes: float64(21), int64(7), object(10)
memory usage: 4.8+ MB
powerco_data_merge["has_gas"].value_counts()
Out[33]: f 13132
t 2964
Name: has_gas, dtype: int64
In [35]: powerco_data_merge["has_gas"]
Out[35]: 0 1
1 0
2 1
3 1
4 1
..
16091 0
16092 1
16093 1
16094 1
16095 1
Name: has_gas, Length: 16096, dtype: uint8
Out[36]:
id channel_sales cons_12m cons_ga
1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0
In [37]: powerco_data_merge.corr()
Out[37]:
cons_12m cons_gas_12m cons_last_month forecast_base_bill_ele forec
29 rows × 29 columns
In [38]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year
In [39]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year
sns.countplot(x="has_gas",data=powerco_data_merge)
sns.countplot(x="churn",data=powerco_data_merge)
In [ ]: