0% found this document useful (0 votes)
76 views26 pages

BCG Internship Task 2

The document discusses data wrangling and exploratory data analysis. It imports customer, pricing, and churn data from CSV files, checks for duplicates, and handles duplicates by dropping them to create clean pricing data for analysis.

Uploaded by

yomoloja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views26 pages

BCG Internship Task 2

The document discusses data wrangling and exploratory data analysis. It imports customer, pricing, and churn data from CSV files, checks for duplicates, and handles duplicates by dropping them to create clean pricing data for analysis.

Uploaded by

yomoloja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

05/01/2021 BCG INTERNSHIP TASK 2

Data Wrangling and Exploratory Data Analysis

Data Wrangling
In [1]: ## Importing neccesary packages into the notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing data from CSV file directory

In [2]: ## Importing the CSV from file directory into the notebook

## Importing the customer data

cust_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_data.csv")

## Importing the pricing data


price_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_hist_dat
a.csv")

## Importing the churn data


churn_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_output.cs
v")

Understanding imported data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 1/26


05/01/2021 BCG INTERNSHIP TASK 2

In [3]: # Understanding Customer data

cust_data.head()

Out[3]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lmkeba

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN foos

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN foos

4 bba03439a292a1e166f80264c16191cb NaN NaN lmkeba

5 rows × 32 columns

In [4]: # Checking if Customer data is duplicated.

cust_data.duplicated("id")

Out[4]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool

the "id" column is not duplicated in the Customer data.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 2/26


05/01/2021 BCG INTERNSHIP TASK 2

In [5]: # Viewing and understanding pricing data

price_data.head()

Out[5]:
id price_date price_p1_var price_p2_var price_p3_var pric

2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-02-
1 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-03-
2 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-04-
3 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01

2015-05-
4 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01

In [6]: # Checking if pricing data is duplicated

price_data.duplicated("id")

Out[6]: 0 False
1 True
2 True
3 True
4 True
...
192997 True
192998 True
192999 True
193000 True
193001 True
Length: 193002, dtype: bool

"id" is duplicated in the pricing data as seen above

Handling duplicates in pricing data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 3/26


05/01/2021 BCG INTERNSHIP TASK 2

In [7]: # We need to drop duplicates on the pricing data.

price_data_new = price_data.drop_duplicates("id", keep = "first")

price_data_new

Out[7]:
id price_date price_p1_var price_p2_var price_p3_var

2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.000000 0.000000
01

2015-01-
12 31f2ce549924679a3cbb2d128ae9ea43 0.125976 0.103395 0.071536
01

2015-01-
24 36b6352b4656216bfdb96f01e9a94b4e 0.123086 0.100505 0.068646
01

2015-01-
36 48f3e6e86f7a8656b2c6b6ce2763055e 0.144431 0.000000 0.000000
01

2015-01-
48 cce88c7d721430d8bd31f71ae686c91e 0.153159 0.130578 0.098720
01

... ... ... ... ... ...

2015-01-
192942 cd622263c26436d1237e94ff05cdd506 0.151367 0.000000 0.000000
01

2015-01-
192954 ed3434c3c1e2056d1a313e2671815e4d 0.128069 0.105843 0.073773
01

2015-01-
192966 d00da2c0c568614b9937791f681cd7d7 0.150211 0.000000 0.000000
01

2015-01-
192978 045f94f0b7f538a8d8fae11080abb5da 0.151367 0.000000 0.000000
01

2015-01-
192990 16f51cdc2baa19af0b940ee1b3dd17d5 0.129444 0.106863 0.075004
01

16096 rows × 8 columns

Duplicates have been removed from the pricing data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 4/26


05/01/2021 BCG INTERNSHIP TASK 2

In [8]: # Checking the info of the new price data.

price_data_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 192990
Data columns (total 8 columns):
id 16096 non-null object
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(6), object(2)
memory usage: 1.1+ MB

We can see missing values on the prices column beacuse the length of each of the columns was short
by 36. We will deal with missing values later.

Data Transformation

In [9]: # To check if the id columns in the new pricing data is same with the customer
data preparing for merging the two data.

check_data_columns = price_data_new.id.isin(cust_data.id).astype(int)

check_data_columns

Out[9]: 0 1
12 1
24 1
36 1
48 1
..
192942 1
192954 1
192966 1
192978 1
192990 1
Name: id, Length: 16096, dtype: int32

In [10]: # Checking the length of the "id" data column check.

check_data_columns.value_counts()

Out[10]: 1 16096
Name: id, dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 5/26


05/01/2021 BCG INTERNSHIP TASK 2

The length of the check is similar to the id length in both new pricing data and customer data. so we can
merge them.

In [11]: # exploring the churn data


churn_data.head()

Out[11]:
id churn

0 48ada52261e7cf58715202705a0451c9 0

1 24011ae4ebbe3035111d65fa7c15bc57 1

2 d29c2c54acc38ff3c0614d0a653813dd 0

3 764c75f661154dac3a6c254cd082ea7d 0

4 bba03439a292a1e166f80264c16191cb 0

In [12]: ## Checking if the elements of column id in churn_data dataframe are unique va


lues

churn_data.duplicated("id", keep=False)

Out[12]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool

The id column in the churn data are not duplicated.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 6/26


05/01/2021 BCG INTERNSHIP TASK 2

In [13]: # check if the id column of the churn data is same as the id column of the cus
tomer data.

check_data_columns2 = churn_data.id.isin(cust_data.id).astype(int)

check_data_columns2

Out[13]: 0 1
1 1
2 1
3 1
4 1
..
16091 1
16092 1
16093 1
16094 1
16095 1
Name: id, Length: 16096, dtype: int32

In [14]: check_data_columns2.value_counts()

Out[14]: 1 16096
Name: id, dtype: int64

Churn data is similar in length of id column to customer data. We can merge both dataframe to a single
data frame.

Merging the different dataframe

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 7/26


05/01/2021 BCG INTERNSHIP TASK 2

In [15]: ## Merging the churn data and the customer data for data wrangling.

powerco_data_initial = cust_data.merge(churn_data, on="id", how= "left")


powerco_data_initial

Out[15]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lm

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN

4 bba03439a292a1e166f80264c16191cb NaN NaN lm

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 NaN NaN

16092 d0a6f71671571ed83b2645d23af6de00 NaN NaN

16093 10e6828ddd62cbcf687cb74928c4c2d2 NaN NaN

16094 1cf20fd6206d7678d5bcafd28c53b4db NaN NaN

16095 563dde550fd624d7352f3de77c0cdfcd NaN NaN

16096 rows × 33 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 8/26


05/01/2021 BCG INTERNSHIP TASK 2

In [16]: # Merging the pricing data with the above powerco_data_initial

powerco_data_merge = powerco_data_initial.merge(price_data_new, on="id", how =


"left")
powerco_data_merge

Out[16]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lm

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN

4 bba03439a292a1e166f80264c16191cb NaN NaN lm

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 NaN NaN

16092 d0a6f71671571ed83b2645d23af6de00 NaN NaN

16093 10e6828ddd62cbcf687cb74928c4c2d2 NaN NaN

16094 1cf20fd6206d7678d5bcafd28c53b4db NaN NaN

16095 563dde550fd624d7352f3de77c0cdfcd NaN NaN

16096 rows × 40 columns

Data Cleaning (Wrangling)

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 9/26


05/01/2021 BCG INTERNSHIP TASK 2

In [17]: ## Understanding the powerco_data_final data

powerco_data_merge.head()

Out[17]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lmkeba

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN foos

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN foos

4 bba03439a292a1e166f80264c16191cb NaN NaN lmkeba

5 rows × 40 columns

In [18]: ## Understanding the descriptive statistics of the data

powerco_data_merge.describe(include="all")

Out[18]:
id activity_new campaign_disc_ele

count 16096 6551 0.0

unique 16096 419 NaN

top c781055508feb0728c63f20eedd2d352 apdekpcbwosbxepsfxclislboipuxpop NaN fo

freq 1 1577 NaN

mean NaN NaN NaN

std NaN NaN NaN

min NaN NaN NaN

25% NaN NaN NaN

50% NaN NaN NaN

75% NaN NaN NaN

max NaN NaN NaN

11 rows × 40 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 10/26


05/01/2021 BCG INTERNSHIP TASK 2

In [19]: ## Understanding the collumn data types and missing values in the collumn

powerco_data_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 40 columns):
id 16096 non-null object
activity_new 6551 non-null object
campaign_disc_ele 0 non-null float64
channel_sales 11878 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 3508 non-null float64
forecast_base_bill_year 3508 non-null float64
forecast_bill_12m 3508 non-null float64
forecast_cons 3508 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 15970 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 15970 non-null float64
forecast_price_energy_p2 15970 non-null float64
forecast_price_pow_p1 15970 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16083 non-null float64
margin_net_pow_ele 16083 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16081 non-null float64
num_years_antig 16096 non-null int64
origin_up 16009 non-null object
pow_max 16093 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(22), int64(7), object(11)
memory usage: 5.0+ MB

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 11/26


05/01/2021 BCG INTERNSHIP TASK 2

Columns like:
activity_new,channel_sales,date_first_activ,date_modif_prod,forecast_base_bill_ele,forecast_base_bill_year,foreca
have very noticeable missing values. We have to investigate the reason for missing values and if we need to let
the column.</span>

In [20]: # Analysing collumns with missing values one by one.

check_activity_new = powerco_data_merge[["activity_new"]]
check_activity_new

Out[20]:
activity_new

0 esoiiifxdlbkcsluxmfuacbdckommixw

1 NaN

2 NaN

3 NaN

4 NaN

... ...

16091 NaN

16092 NaN

16093 NaN

16094 NaN

16095 NaN

16096 rows × 1 columns

In [21]: check_activity_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
activity_new 6551 non-null object
dtypes: object(1)
memory usage: 251.5+ KB

activity_new is the category of the companys activity. Since there data in the column is encrypted, and
there is no way to know the meaning of the data and so we cannot perform a transformation and operate
on the column. We consider deleting the column. Same with the campaign_disc_ele whose elements are
NaN

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 12/26


05/01/2021 BCG INTERNSHIP TASK 2

In [22]: check_campaign_disc_ele = powerco_data_merge[["campaign_disc_ele"]]


check_campaign_disc_ele

Out[22]:
campaign_disc_ele

0 NaN

1 NaN

2 NaN

3 NaN

4 NaN

... ...

16091 NaN

16092 NaN

16093 NaN

16094 NaN

16095 NaN

16096 rows × 1 columns

In [23]: check_campaign_disc_ele.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
campaign_disc_ele 0 non-null float64
dtypes: float64(1)
memory usage: 251.5 KB

In [24]: # Deleting the activity_new and campaign_disc_ele column

del powerco_data_merge["activity_new"]

In [25]: del powerco_data_merge["campaign_disc_ele"]

Checking the data to see the exit of the two irrelevant columns due to excessive missing values.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 13/26


05/01/2021 BCG INTERNSHIP TASK 2

In [26]: powerco_data_merge

Out[26]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd NaN 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd NaN 8730

16096 rows × 38 columns

In [27]: # Column Channel sales( code of the channel sales) also have missing values

channel_sales=powerco_data_merge[["channel_sales"]]

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 14/26


05/01/2021 BCG INTERNSHIP TASK 2

In [28]: channel_sales.isnull()

Out[28]:
channel_sales

0 False

1 False

2 True

3 False

4 False

... ...

16091 False

16092 False

16093 False

16094 False

16095 True

16096 rows × 1 columns

we have 4218 number of missing values of the channel_sales . We will attempt to replace the missing
values by forward fill.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 15/26


05/01/2021 BCG INTERNSHIP TASK 2

In [29]: # We use the fillna method to fill forward without limit so we can have all pl
aces in the column fixed.

channel_sales1 = powerco_data_merge[["channel_sales"]].fillna(method="pad")

channel_sales1

Out[29]:
channel_sales

0 lmkebamcaaclubfxadlmueccxoimlema

1 foosdfpfkusacimwkcsosbicdxkicaua

2 foosdfpfkusacimwkcsosbicdxkicaua

3 foosdfpfkusacimwkcsosbicdxkicaua

4 lmkebamcaaclubfxadlmueccxoimlema

... ...

16091 foosdfpfkusacimwkcsosbicdxkicaua

16092 foosdfpfkusacimwkcsosbicdxkicaua

16093 foosdfpfkusacimwkcsosbicdxkicaua

16094 foosdfpfkusacimwkcsosbicdxkicaua

16095 foosdfpfkusacimwkcsosbicdxkicaua

16096 rows × 1 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 16/26


05/01/2021 BCG INTERNSHIP TASK 2

In [30]: powerco_data_merge["channel_sales"]=channel_sales1
powerco_data_merge

Out[30]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd foosdfpfkusacimwkcsosbicdxkicaua 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd foosdfpfkusacimwkcsosbicdxkicaua 8730

16096 rows × 38 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 17/26


05/01/2021 BCG INTERNSHIP TASK 2

In [31]: # Channell sales has no more missing values.


powerco_data_merge.isnull().sum()

Out[31]: id 0
channel_sales 0
cons_12m 0
cons_gas_12m 0
cons_last_month 0
date_activ 0
date_end 2
date_first_activ 12588
date_modif_prod 157
date_renewal 40
forecast_base_bill_ele 12588
forecast_base_bill_year 12588
forecast_bill_12m 12588
forecast_cons 12588
forecast_cons_12m 0
forecast_cons_year 0
forecast_discount_energy 126
forecast_meter_rent_12m 0
forecast_price_energy_p1 126
forecast_price_energy_p2 126
forecast_price_pow_p1 126
has_gas 0
imp_cons 0
margin_gross_pow_ele 13
margin_net_pow_ele 13
nb_prod_act 0
net_margin 15
num_years_antig 0
origin_up 87
pow_max 3
churn 0
price_date 0
price_p1_var 36
price_p2_var 36
price_p3_var 36
price_p1_fix 36
price_p2_fix 36
price_p3_fix 36
dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 18/26


05/01/2021 BCG INTERNSHIP TASK 2

In [32]: # lets replace the all the numeric series with their mean

mean_price_p1_var = powerco_data_merge["price_p1_var"].sum()/16096

powerco_data_merge["price_p1_var"]=powerco_data_merge["price_p1_var"].fillna(m
ean_price_p1_var)

mean_price_p2_var = powerco_data_merge["price_p2_var"].sum()/16096

powerco_data_merge["price_p2_var"]=powerco_data_merge["price_p2_var"].fillna(m
ean_price_p2_var)

mean_price_p3_var = powerco_data_merge["price_p3_var"].sum()/16096

powerco_data_merge["price_p3_var"]=powerco_data_merge["price_p3_var"].fillna(m
ean_price_p3_var)

# p1_fix

mean_price_p1_fix = powerco_data_merge["price_p1_fix"].sum()/16096

powerco_data_merge["price_p1_fix"]=powerco_data_merge["price_p1_fix"].fillna(m
ean_price_p1_fix )

# p2_fix

mean_price_p2_fix = powerco_data_merge["price_p2_fix"].sum()/16096

powerco_data_merge["price_p2_fix"]=powerco_data_merge["price_p2_fix"].fillna(m
ean_price_p2_fix )

mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096

powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)

# price_p3_fix

mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096

powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)

# margin_gross_pow_ele

mean_margin_gross_pow_ele = powerco_data_merge["margin_gross_pow_ele"].sum()/
16096

powerco_data_merge["margin_gross_pow_ele"]=powerco_data_merge["margin_gross_po
w_ele"].fillna(mean_margin_gross_pow_ele)

# margin_net_pow_ele

mean_margin_net_pow_ele = powerco_data_merge["margin_net_pow_ele"].sum()/160

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 19/26


05/01/2021 BCG INTERNSHIP TASK 2

96

powerco_data_merge["margin_net_pow_ele"]=powerco_data_merge["margin_net_pow_el
e"].fillna(mean_margin_net_pow_ele)

# net_margin

mean_net_margin = powerco_data_merge["net_margin"].sum()/16096

powerco_data_merge["net_margin"]=powerco_data_merge["net_margin"].fillna(mean_
net_margin)

# pow_max

mean_pow_max = powerco_data_merge["pow_max"].sum()/16096

powerco_data_merge["pow_max"]=powerco_data_merge["pow_max"].fillna(mean_pow_ma
x)

# origin_up

powerco_data_merge["origin_up"]= powerco_data_merge["origin_up"].fillna(method
="pad")

# forecast_base_bill_ele

powerco_data_merge["forecast_base_bill_ele"]= powerco_data_merge["forecast_bas
e_bill_ele"].fillna(value=0)

# forecast_base_bill_year

powerco_data_merge["forecast_base_bill_year"]= powerco_data_merge["forecast_ba
se_bill_year"].fillna(value=0)

# forecast_bill_12m
powerco_data_merge["forecast_bill_12m"]= powerco_data_merge["forecast_bill_12
m"].fillna(value=0)

#forecast_cons
powerco_data_merge["forecast_cons"]= powerco_data_merge["forecast_cons"].filln
a(value=0)

# forecast_discount_energy

mean_forecast_discount_energy = powerco_data_merge["forecast_discount_energ
y"].sum()/16096

powerco_data_merge["forecast_discount_energy"]= powerco_data_merge["forecast_d
iscount_energy"].fillna(mean_forecast_discount_energy )

# forecast_price_energy_p1

mean_forecast_price_energy_p1 = powerco_data_merge["forecast_price_energy_p
1"].sum()/16096

powerco_data_merge["forecast_price_energy_p1"]= powerco_data_merge["forecast_p
rice_energy_p1"].fillna(mean_forecast_price_energy_p1)

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 20/26


05/01/2021 BCG INTERNSHIP TASK 2

# forecast_price_energy_p2

mean_forecast_price_energy_p2 = powerco_data_merge["forecast_price_energy_
p2"].sum()/16096

powerco_data_merge["forecast_price_energy_p2"]= powerco_data_merge["forecast_p
rice_energy_p2"].fillna(mean_forecast_price_energy_p2)

# forecast_price_pow_p1

mean_forecast_price_pow_p1 = powerco_data_merge["forecast_price_pow_p1"].
sum()/16096

powerco_data_merge["forecast_price_pow_p1"]= powerco_data_merge["forecast_pric
e_pow_p1"].fillna(mean_forecast_price_energy_p2)

powerco_data_merge.isnull().sum()

powerco_data_merge.info()

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 21/26


05/01/2021 BCG INTERNSHIP TASK 2

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 38 columns):
id 16096 non-null object
channel_sales 16096 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 16096 non-null float64
forecast_base_bill_year 16096 non-null float64
forecast_bill_12m 16096 non-null float64
forecast_cons 16096 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 16096 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 16096 non-null float64
forecast_price_energy_p2 16096 non-null float64
forecast_price_pow_p1 16096 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16096 non-null float64
margin_net_pow_ele 16096 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16096 non-null float64
num_years_antig 16096 non-null int64
origin_up 16096 non-null object
pow_max 16096 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16096 non-null float64
price_p2_var 16096 non-null float64
price_p3_var 16096 non-null float64
price_p1_fix 16096 non-null float64
price_p2_fix 16096 non-null float64
price_p3_fix 16096 non-null float64
dtypes: float64(21), int64(7), object(10)
memory usage: 4.8+ MB

Dealing with Categorical Variable.

In [33]: # Column has_gas is a Categorical variable that needs to be converted to integ


er from string.

powerco_data_merge["has_gas"].value_counts()

Out[33]: f 13132
t 2964
Name: has_gas, dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 22/26


05/01/2021 BCG INTERNSHIP TASK 2

In [34]: powerco_data_merge["has_gas"] = pd.get_dummies(powerco_data_merge["has_gas"])

In [35]: powerco_data_merge["has_gas"]

Out[35]: 0 1
1 0
2 1
3 1
4 1
..
16091 0
16092 1
16093 1
16094 1
16095 1
Name: has_gas, Length: 16096, dtype: uint8

Exploratory Data Analysis


In [36]: powerco_data_merge

Out[36]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd foosdfpfkusacimwkcsosbicdxkicaua 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd foosdfpfkusacimwkcsosbicdxkicaua 8730

16096 rows × 38 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 23/26


05/01/2021 BCG INTERNSHIP TASK 2

In [37]: powerco_data_merge.corr()

Out[37]:
cons_12m cons_gas_12m cons_last_month forecast_base_bill_ele forec

cons_12m 1.000000 0.471233 0.919545 0.073093

cons_gas_12m 0.471233 1.000000 0.447209 0.081878

cons_last_month 0.919545 0.447209 1.000000 0.067431

forecast_base_bill_ele 0.073093 0.081878 0.067431 1.000000

forecast_base_bill_year 0.073093 0.081878 0.067431 1.000000

forecast_bill_12m 0.078228 0.084076 0.064444 0.833342

forecast_cons 0.073653 0.074400 0.068190 0.968847

forecast_cons_12m 0.165168 0.059525 0.129574 0.318761

forecast_cons_year 0.139526 0.057619 0.151476 0.363838

forecast_discount_energy -0.043551 -0.014410 -0.037699 0.011114

forecast_meter_rent_12m 0.085996 0.040327 0.076066 0.214090

forecast_price_energy_p1 -0.033440 -0.021711 -0.024193 -0.127930

forecast_price_energy_p2 0.146225 0.075606 0.122923 0.142913

forecast_price_pow_p1 -0.022942 -0.038159 -0.015738 0.029144

has_gas -0.229761 -0.372771 -0.202702 -0.042248

imp_cons 0.139353 0.060609 0.153861 0.382069

margin_gross_pow_ele -0.065186 -0.016866 -0.054069 -0.032531

margin_net_pow_ele -0.045560 -0.008242 -0.037665 -0.021195

nb_prod_act 0.308567 0.272005 0.350711 0.000761

net_margin 0.119908 0.058928 0.096343 0.301896

num_years_antig 0.008810 -0.008626 0.004860 -0.071432

pow_max 0.102422 0.052365 0.089565 0.274834

churn -0.051759 -0.040880 -0.046931 0.038639

price_p1_var -0.016911 -0.012686 -0.009622 -0.119629

price_p2_var 0.141069 0.077770 0.118864 0.137585

price_p3_var 0.059166 0.047497 0.046869 0.162319

price_p1_fix -0.010066 -0.019750 -0.008352 0.050983

price_p2_fix 0.063390 0.042713 0.049753 0.184279

price_p3_fix 0.069990 0.054606 0.055159 0.146502

29 rows × 29 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 24/26


05/01/2021 BCG INTERNSHIP TASK 2

In [38]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year

sns.scatterplot(x="forecast_base_bill_ele", y="forecast_base_bill_year", data=


powerco_data_merge)

Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25a4830c6c8>

In [39]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year

sns.scatterplot(x="forecast_cons", y="forecast_base_bill_year", data=powerco_d


ata_merge)

Out[39]: <matplotlib.axes._subplots.AxesSubplot at 0x25a487077c8>

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 25/26


05/01/2021 BCG INTERNSHIP TASK 2

In [40]: # Number of clients who also have gas

sns.countplot(x="has_gas",data=powerco_data_merge)

Out[40]: <matplotlib.axes._subplots.AxesSubplot at 0x25a488ea5c8>

In [41]: # Number of customers who churn

sns.countplot(x="churn",data=powerco_data_merge)

Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25a4892d108>

In [ ]:

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 26/26

You might also like