0% found this document useful (0 votes)

76 views26 pages

BCG Internship Task 2

The document discusses data wrangling and exploratory data analysis. It imports customer, pricing, and churn data from CSV files, checks for duplicates, and handles duplicates by dropping them to create clean pricing data for analysis.

Uploaded by

yomoloja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views26 pages

BCG Internship Task 2

Uploaded by

yomoloja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

05/01/2021 BCG INTERNSHIP TASK 2

Data Wrangling and Exploratory Data Analysis

Data Wrangling
In [1]: ## Importing neccesary packages into the notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Importing data from CSV file directory

In [2]: ## Importing the CSV from file directory into the notebook

## Importing the customer data

cust_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_data.csv")

## Importing the pricing data

price_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_hist_dat
a.csv")

## Importing the churn data

churn_data = pd.read_csv(r"C:\Users\yomol\Downloads\ml_case_training_output.cs
v")

Understanding imported data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 1/26

05/01/2021 BCG INTERNSHIP TASK 2

In [3]: # Understanding Customer data

cust_data.head()

Out[3]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lmkeba

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN foos

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN foos

4 bba03439a292a1e166f80264c16191cb NaN NaN lmkeba

5 rows × 32 columns

In [4]: # Checking if Customer data is duplicated.

cust_data.duplicated("id")

Out[4]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool

the "id" column is not duplicated in the Customer data.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 2/26

05/01/2021 BCG INTERNSHIP TASK 2

In [5]: # Viewing and understanding pricing data

price_data.head()

Out[5]:
id price_date price_p1_var price_p2_var price_p3_var pric

2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-02-
1 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-03-
2 038af19179925da21a25619c5a24b745 0.151367 0.0 0.0 44
01

2015-04-
3 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01

2015-05-
4 038af19179925da21a25619c5a24b745 0.149626 0.0 0.0 44
01

In [6]: # Checking if pricing data is duplicated

price_data.duplicated("id")

Out[6]: 0 False
1 True
2 True
3 True
4 True
...
192997 True
192998 True
192999 True
193000 True
193001 True
Length: 193002, dtype: bool

"id" is duplicated in the pricing data as seen above

Handling duplicates in pricing data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 3/26

05/01/2021 BCG INTERNSHIP TASK 2

In [7]: # We need to drop duplicates on the pricing data.

price_data_new = price_data.drop_duplicates("id", keep = "first")

price_data_new

Out[7]:
id price_date price_p1_var price_p2_var price_p3_var

2015-01-
0 038af19179925da21a25619c5a24b745 0.151367 0.000000 0.000000
01

2015-01-
12 31f2ce549924679a3cbb2d128ae9ea43 0.125976 0.103395 0.071536
01

2015-01-
24 36b6352b4656216bfdb96f01e9a94b4e 0.123086 0.100505 0.068646
01

2015-01-
36 48f3e6e86f7a8656b2c6b6ce2763055e 0.144431 0.000000 0.000000
01

2015-01-
48 cce88c7d721430d8bd31f71ae686c91e 0.153159 0.130578 0.098720
01

... ... ... ... ... ...

2015-01-
192942 cd622263c26436d1237e94ff05cdd506 0.151367 0.000000 0.000000
01

2015-01-
192954 ed3434c3c1e2056d1a313e2671815e4d 0.128069 0.105843 0.073773
01

2015-01-
192966 d00da2c0c568614b9937791f681cd7d7 0.150211 0.000000 0.000000
01

2015-01-
192978 045f94f0b7f538a8d8fae11080abb5da 0.151367 0.000000 0.000000
01

2015-01-
192990 16f51cdc2baa19af0b940ee1b3dd17d5 0.129444 0.106863 0.075004
01

16096 rows × 8 columns

Duplicates have been removed from the pricing data

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 4/26

05/01/2021 BCG INTERNSHIP TASK 2

In [8]: # Checking the info of the new price data.

price_data_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 192990
Data columns (total 8 columns):
id 16096 non-null object
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(6), object(2)
memory usage: 1.1+ MB

We can see missing values on the prices column beacuse the length of each of the columns was short
by 36. We will deal with missing values later.

Data Transformation

In [9]: # To check if the id columns in the new pricing data is same with the customer
data preparing for merging the two data.

check_data_columns = price_data_new.id.isin(cust_data.id).astype(int)

check_data_columns

Out[9]: 0 1
12 1
24 1
36 1
48 1
..
192942 1
192954 1
192966 1
192978 1
192990 1
Name: id, Length: 16096, dtype: int32

In [10]: # Checking the length of the "id" data column check.

check_data_columns.value_counts()

Out[10]: 1 16096
Name: id, dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 5/26

05/01/2021 BCG INTERNSHIP TASK 2

The length of the check is similar to the id length in both new pricing data and customer data. so we can
merge them.

In [11]: # exploring the churn data

churn_data.head()

Out[11]:
id churn

0 48ada52261e7cf58715202705a0451c9 0

1 24011ae4ebbe3035111d65fa7c15bc57 1

2 d29c2c54acc38ff3c0614d0a653813dd 0

3 764c75f661154dac3a6c254cd082ea7d 0

4 bba03439a292a1e166f80264c16191cb 0

In [12]: ## Checking if the elements of column id in churn_data dataframe are unique va

lues

churn_data.duplicated("id", keep=False)

Out[12]: 0 False
1 False
2 False
3 False
4 False
...
16091 False
16092 False
16093 False
16094 False
16095 False
Length: 16096, dtype: bool

The id column in the churn data are not duplicated.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 6/26

05/01/2021 BCG INTERNSHIP TASK 2

In [13]: # check if the id column of the churn data is same as the id column of the cus
tomer data.

check_data_columns2 = churn_data.id.isin(cust_data.id).astype(int)

check_data_columns2

Out[13]: 0 1
1 1
2 1
3 1
4 1
..
16091 1
16092 1
16093 1
16094 1
16095 1
Name: id, Length: 16096, dtype: int32

In [14]: check_data_columns2.value_counts()

Out[14]: 1 16096
Name: id, dtype: int64

Churn data is similar in length of id column to customer data. We can merge both dataframe to a single
data frame.

Merging the different dataframe

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 7/26

05/01/2021 BCG INTERNSHIP TASK 2

In [15]: ## Merging the churn data and the customer data for data wrangling.

powerco_data_initial = cust_data.merge(churn_data, on="id", how= "left")

powerco_data_initial

Out[15]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lm

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN

4 bba03439a292a1e166f80264c16191cb NaN NaN lm

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 NaN NaN

16092 d0a6f71671571ed83b2645d23af6de00 NaN NaN

16093 10e6828ddd62cbcf687cb74928c4c2d2 NaN NaN

16094 1cf20fd6206d7678d5bcafd28c53b4db NaN NaN

16095 563dde550fd624d7352f3de77c0cdfcd NaN NaN

16096 rows × 33 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 8/26

05/01/2021 BCG INTERNSHIP TASK 2

In [16]: # Merging the pricing data with the above powerco_data_initial

powerco_data_merge = powerco_data_initial.merge(price_data_new, on="id", how =

"left")
powerco_data_merge

Out[16]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lm

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN

4 bba03439a292a1e166f80264c16191cb NaN NaN lm

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 NaN NaN

16092 d0a6f71671571ed83b2645d23af6de00 NaN NaN

16093 10e6828ddd62cbcf687cb74928c4c2d2 NaN NaN

16094 1cf20fd6206d7678d5bcafd28c53b4db NaN NaN

16095 563dde550fd624d7352f3de77c0cdfcd NaN NaN

16096 rows × 40 columns

Data Cleaning (Wrangling)

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 9/26

05/01/2021 BCG INTERNSHIP TASK 2

In [17]: ## Understanding the powerco_data_final data

powerco_data_merge.head()

Out[17]:
id activity_new campaign_disc_ele

0 48ada52261e7cf58715202705a0451c9 esoiiifxdlbkcsluxmfuacbdckommixw NaN lmkeba

1 24011ae4ebbe3035111d65fa7c15bc57 NaN NaN foos

2 d29c2c54acc38ff3c0614d0a653813dd NaN NaN

3 764c75f661154dac3a6c254cd082ea7d NaN NaN foos

4 bba03439a292a1e166f80264c16191cb NaN NaN lmkeba

5 rows × 40 columns

In [18]: ## Understanding the descriptive statistics of the data

powerco_data_merge.describe(include="all")

Out[18]:
id activity_new campaign_disc_ele

count 16096 6551 0.0

unique 16096 419 NaN

top c781055508feb0728c63f20eedd2d352 apdekpcbwosbxepsfxclislboipuxpop NaN fo

freq 1 1577 NaN

mean NaN NaN NaN

std NaN NaN NaN

min NaN NaN NaN

25% NaN NaN NaN

50% NaN NaN NaN

75% NaN NaN NaN

max NaN NaN NaN

11 rows × 40 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 10/26

05/01/2021 BCG INTERNSHIP TASK 2

In [19]: ## Understanding the collumn data types and missing values in the collumn

powerco_data_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 40 columns):
id 16096 non-null object
activity_new 6551 non-null object
campaign_disc_ele 0 non-null float64
channel_sales 11878 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 3508 non-null float64
forecast_base_bill_year 3508 non-null float64
forecast_bill_12m 3508 non-null float64
forecast_cons 3508 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 15970 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 15970 non-null float64
forecast_price_energy_p2 15970 non-null float64
forecast_price_pow_p1 15970 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16083 non-null float64
margin_net_pow_ele 16083 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16081 non-null float64
num_years_antig 16096 non-null int64
origin_up 16009 non-null object
pow_max 16093 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16060 non-null float64
price_p2_var 16060 non-null float64
price_p3_var 16060 non-null float64
price_p1_fix 16060 non-null float64
price_p2_fix 16060 non-null float64
price_p3_fix 16060 non-null float64
dtypes: float64(22), int64(7), object(11)
memory usage: 5.0+ MB

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 11/26

05/01/2021 BCG INTERNSHIP TASK 2

Columns like:
activity_new,channel_sales,date_first_activ,date_modif_prod,forecast_base_bill_ele,forecast_base_bill_year,foreca
have very noticeable missing values. We have to investigate the reason for missing values and if we need to let
the column.</span>

In [20]: # Analysing collumns with missing values one by one.

check_activity_new = powerco_data_merge[["activity_new"]]
check_activity_new

Out[20]:
activity_new

0 esoiiifxdlbkcsluxmfuacbdckommixw

1 NaN

2 NaN

3 NaN

4 NaN

... ...

16091 NaN

16092 NaN

16093 NaN

16094 NaN

16095 NaN

16096 rows × 1 columns

In [21]: check_activity_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
activity_new 6551 non-null object
dtypes: object(1)
memory usage: 251.5+ KB

activity_new is the category of the companys activity. Since there data in the column is encrypted, and
there is no way to know the meaning of the data and so we cannot perform a transformation and operate
on the column. We consider deleting the column. Same with the campaign_disc_ele whose elements are
NaN

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 12/26

05/01/2021 BCG INTERNSHIP TASK 2

In [22]: check_campaign_disc_ele = powerco_data_merge[["campaign_disc_ele"]]

check_campaign_disc_ele

Out[22]:
campaign_disc_ele

0 NaN

1 NaN

2 NaN

3 NaN

4 NaN

... ...

16091 NaN

16092 NaN

16093 NaN

16094 NaN

16095 NaN

16096 rows × 1 columns

In [23]: check_campaign_disc_ele.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 1 columns):
campaign_disc_ele 0 non-null float64
dtypes: float64(1)
memory usage: 251.5 KB

In [24]: # Deleting the activity_new and campaign_disc_ele column

del powerco_data_merge["activity_new"]

In [25]: del powerco_data_merge["campaign_disc_ele"]

Checking the data to see the exit of the two irrelevant columns due to excessive missing values.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 13/26

05/01/2021 BCG INTERNSHIP TASK 2

In [26]: powerco_data_merge

Out[26]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd NaN 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd NaN 8730

16096 rows × 38 columns

In [27]: # Column Channel sales( code of the channel sales) also have missing values

channel_sales=powerco_data_merge[["channel_sales"]]

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 14/26

05/01/2021 BCG INTERNSHIP TASK 2

In [28]: channel_sales.isnull()

Out[28]:
channel_sales

0 False

1 False

2 True

3 False

4 False

... ...

16091 False

16092 False

16093 False

16094 False

16095 True

16096 rows × 1 columns

we have 4218 number of missing values of the channel_sales . We will attempt to replace the missing
values by forward fill.

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 15/26

05/01/2021 BCG INTERNSHIP TASK 2

In [29]: # We use the fillna method to fill forward without limit so we can have all pl
aces in the column fixed.

channel_sales1 = powerco_data_merge[["channel_sales"]].fillna(method="pad")

channel_sales1

Out[29]:
channel_sales

0 lmkebamcaaclubfxadlmueccxoimlema

1 foosdfpfkusacimwkcsosbicdxkicaua

2 foosdfpfkusacimwkcsosbicdxkicaua

3 foosdfpfkusacimwkcsosbicdxkicaua

4 lmkebamcaaclubfxadlmueccxoimlema

... ...

16091 foosdfpfkusacimwkcsosbicdxkicaua

16092 foosdfpfkusacimwkcsosbicdxkicaua

16093 foosdfpfkusacimwkcsosbicdxkicaua

16094 foosdfpfkusacimwkcsosbicdxkicaua

16095 foosdfpfkusacimwkcsosbicdxkicaua

16096 rows × 1 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 16/26

05/01/2021 BCG INTERNSHIP TASK 2

In [30]: powerco_data_merge["channel_sales"]=channel_sales1
powerco_data_merge

Out[30]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd foosdfpfkusacimwkcsosbicdxkicaua 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd foosdfpfkusacimwkcsosbicdxkicaua 8730

16096 rows × 38 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 17/26

05/01/2021 BCG INTERNSHIP TASK 2

In [31]: # Channell sales has no more missing values.

powerco_data_merge.isnull().sum()

Out[31]: id 0
channel_sales 0
cons_12m 0
cons_gas_12m 0
cons_last_month 0
date_activ 0
date_end 2
date_first_activ 12588
date_modif_prod 157
date_renewal 40
forecast_base_bill_ele 12588
forecast_base_bill_year 12588
forecast_bill_12m 12588
forecast_cons 12588
forecast_cons_12m 0
forecast_cons_year 0
forecast_discount_energy 126
forecast_meter_rent_12m 0
forecast_price_energy_p1 126
forecast_price_energy_p2 126
forecast_price_pow_p1 126
has_gas 0
imp_cons 0
margin_gross_pow_ele 13
margin_net_pow_ele 13
nb_prod_act 0
net_margin 15
num_years_antig 0
origin_up 87
pow_max 3
churn 0
price_date 0
price_p1_var 36
price_p2_var 36
price_p3_var 36
price_p1_fix 36
price_p2_fix 36
price_p3_fix 36
dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 18/26

05/01/2021 BCG INTERNSHIP TASK 2

In [32]: # lets replace the all the numeric series with their mean

mean_price_p1_var = powerco_data_merge["price_p1_var"].sum()/16096

powerco_data_merge["price_p1_var"]=powerco_data_merge["price_p1_var"].fillna(m
ean_price_p1_var)

mean_price_p2_var = powerco_data_merge["price_p2_var"].sum()/16096

powerco_data_merge["price_p2_var"]=powerco_data_merge["price_p2_var"].fillna(m
ean_price_p2_var)

mean_price_p3_var = powerco_data_merge["price_p3_var"].sum()/16096

powerco_data_merge["price_p3_var"]=powerco_data_merge["price_p3_var"].fillna(m
ean_price_p3_var)

# p1_fix

mean_price_p1_fix = powerco_data_merge["price_p1_fix"].sum()/16096

powerco_data_merge["price_p1_fix"]=powerco_data_merge["price_p1_fix"].fillna(m
ean_price_p1_fix )

# p2_fix

mean_price_p2_fix = powerco_data_merge["price_p2_fix"].sum()/16096

powerco_data_merge["price_p2_fix"]=powerco_data_merge["price_p2_fix"].fillna(m
ean_price_p2_fix )

mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096

powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)

# price_p3_fix

mean_price_p3_fix = powerco_data_merge["price_p3_fix"].sum()/16096

powerco_data_merge["price_p3_fix"]=powerco_data_merge["price_p3_fix"].fillna(m
ean_price_p3_fix)

# margin_gross_pow_ele

mean_margin_gross_pow_ele = powerco_data_merge["margin_gross_pow_ele"].sum()/
16096

powerco_data_merge["margin_gross_pow_ele"]=powerco_data_merge["margin_gross_po
w_ele"].fillna(mean_margin_gross_pow_ele)

# margin_net_pow_ele

mean_margin_net_pow_ele = powerco_data_merge["margin_net_pow_ele"].sum()/160

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 19/26

05/01/2021 BCG INTERNSHIP TASK 2

powerco_data_merge["margin_net_pow_ele"]=powerco_data_merge["margin_net_pow_el
e"].fillna(mean_margin_net_pow_ele)

# net_margin

mean_net_margin = powerco_data_merge["net_margin"].sum()/16096

powerco_data_merge["net_margin"]=powerco_data_merge["net_margin"].fillna(mean_
net_margin)

# pow_max

mean_pow_max = powerco_data_merge["pow_max"].sum()/16096

powerco_data_merge["pow_max"]=powerco_data_merge["pow_max"].fillna(mean_pow_ma
x)

# origin_up

powerco_data_merge["origin_up"]= powerco_data_merge["origin_up"].fillna(method
="pad")

# forecast_base_bill_ele

powerco_data_merge["forecast_base_bill_ele"]= powerco_data_merge["forecast_bas
e_bill_ele"].fillna(value=0)

# forecast_base_bill_year

powerco_data_merge["forecast_base_bill_year"]= powerco_data_merge["forecast_ba
se_bill_year"].fillna(value=0)

# forecast_bill_12m
powerco_data_merge["forecast_bill_12m"]= powerco_data_merge["forecast_bill_12
m"].fillna(value=0)

#forecast_cons
powerco_data_merge["forecast_cons"]= powerco_data_merge["forecast_cons"].filln
a(value=0)

# forecast_discount_energy

mean_forecast_discount_energy = powerco_data_merge["forecast_discount_energ
y"].sum()/16096

powerco_data_merge["forecast_discount_energy"]= powerco_data_merge["forecast_d
iscount_energy"].fillna(mean_forecast_discount_energy )

# forecast_price_energy_p1

mean_forecast_price_energy_p1 = powerco_data_merge["forecast_price_energy_p
1"].sum()/16096

powerco_data_merge["forecast_price_energy_p1"]= powerco_data_merge["forecast_p
rice_energy_p1"].fillna(mean_forecast_price_energy_p1)

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 20/26

05/01/2021 BCG INTERNSHIP TASK 2

# forecast_price_energy_p2

mean_forecast_price_energy_p2 = powerco_data_merge["forecast_price_energy_
p2"].sum()/16096

powerco_data_merge["forecast_price_energy_p2"]= powerco_data_merge["forecast_p
rice_energy_p2"].fillna(mean_forecast_price_energy_p2)

# forecast_price_pow_p1

mean_forecast_price_pow_p1 = powerco_data_merge["forecast_price_pow_p1"].
sum()/16096

powerco_data_merge["forecast_price_pow_p1"]= powerco_data_merge["forecast_pric
e_pow_p1"].fillna(mean_forecast_price_energy_p2)

powerco_data_merge.isnull().sum()

powerco_data_merge.info()

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 21/26

05/01/2021 BCG INTERNSHIP TASK 2

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16096 entries, 0 to 16095
Data columns (total 38 columns):
id 16096 non-null object
channel_sales 16096 non-null object
cons_12m 16096 non-null int64
cons_gas_12m 16096 non-null int64
cons_last_month 16096 non-null int64
date_activ 16096 non-null object
date_end 16094 non-null object
date_first_activ 3508 non-null object
date_modif_prod 15939 non-null object
date_renewal 16056 non-null object
forecast_base_bill_ele 16096 non-null float64
forecast_base_bill_year 16096 non-null float64
forecast_bill_12m 16096 non-null float64
forecast_cons 16096 non-null float64
forecast_cons_12m 16096 non-null float64
forecast_cons_year 16096 non-null int64
forecast_discount_energy 16096 non-null float64
forecast_meter_rent_12m 16096 non-null float64
forecast_price_energy_p1 16096 non-null float64
forecast_price_energy_p2 16096 non-null float64
forecast_price_pow_p1 16096 non-null float64
has_gas 16096 non-null object
imp_cons 16096 non-null float64
margin_gross_pow_ele 16096 non-null float64
margin_net_pow_ele 16096 non-null float64
nb_prod_act 16096 non-null int64
net_margin 16096 non-null float64
num_years_antig 16096 non-null int64
origin_up 16096 non-null object
pow_max 16096 non-null float64
churn 16096 non-null int64
price_date 16096 non-null object
price_p1_var 16096 non-null float64
price_p2_var 16096 non-null float64
price_p3_var 16096 non-null float64
price_p1_fix 16096 non-null float64
price_p2_fix 16096 non-null float64
price_p3_fix 16096 non-null float64
dtypes: float64(21), int64(7), object(10)
memory usage: 4.8+ MB

Dealing with Categorical Variable.

In [33]: # Column has_gas is a Categorical variable that needs to be converted to integ

er from string.

powerco_data_merge["has_gas"].value_counts()

Out[33]: f 13132
t 2964
Name: has_gas, dtype: int64

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 22/26

05/01/2021 BCG INTERNSHIP TASK 2

In [34]: powerco_data_merge["has_gas"] = pd.get_dummies(powerco_data_merge["has_gas"])

In [35]: powerco_data_merge["has_gas"]

Out[35]: 0 1
1 0
2 1
3 1
4 1
..
16091 0
16092 1
16093 1
16094 1
16095 1
Name: has_gas, Length: 16096, dtype: uint8

Exploratory Data Analysis

In [36]: powerco_data_merge

Out[36]:
id channel_sales cons_12m cons_ga

0 48ada52261e7cf58715202705a0451c9 lmkebamcaaclubfxadlmueccxoimlema 309275

1 24011ae4ebbe3035111d65fa7c15bc57 foosdfpfkusacimwkcsosbicdxkicaua 0

2 d29c2c54acc38ff3c0614d0a653813dd foosdfpfkusacimwkcsosbicdxkicaua 4660

3 764c75f661154dac3a6c254cd082ea7d foosdfpfkusacimwkcsosbicdxkicaua 544

4 bba03439a292a1e166f80264c16191cb lmkebamcaaclubfxadlmueccxoimlema 1584

... ... ... ...

16091 18463073fb097fc0ac5d3e040f356987 foosdfpfkusacimwkcsosbicdxkicaua 32270

16092 d0a6f71671571ed83b2645d23af6de00 foosdfpfkusacimwkcsosbicdxkicaua 7223

16093 10e6828ddd62cbcf687cb74928c4c2d2 foosdfpfkusacimwkcsosbicdxkicaua 1844

16094 1cf20fd6206d7678d5bcafd28c53b4db foosdfpfkusacimwkcsosbicdxkicaua 131

16095 563dde550fd624d7352f3de77c0cdfcd foosdfpfkusacimwkcsosbicdxkicaua 8730

16096 rows × 38 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 23/26

05/01/2021 BCG INTERNSHIP TASK 2

In [37]: powerco_data_merge.corr()

Out[37]:
cons_12m cons_gas_12m cons_last_month forecast_base_bill_ele forec

cons_12m 1.000000 0.471233 0.919545 0.073093

cons_gas_12m 0.471233 1.000000 0.447209 0.081878

cons_last_month 0.919545 0.447209 1.000000 0.067431

forecast_base_bill_ele 0.073093 0.081878 0.067431 1.000000

forecast_base_bill_year 0.073093 0.081878 0.067431 1.000000

forecast_bill_12m 0.078228 0.084076 0.064444 0.833342

forecast_cons 0.073653 0.074400 0.068190 0.968847

forecast_cons_12m 0.165168 0.059525 0.129574 0.318761

forecast_cons_year 0.139526 0.057619 0.151476 0.363838

forecast_discount_energy -0.043551 -0.014410 -0.037699 0.011114

forecast_meter_rent_12m 0.085996 0.040327 0.076066 0.214090

forecast_price_energy_p1 -0.033440 -0.021711 -0.024193 -0.127930

forecast_price_energy_p2 0.146225 0.075606 0.122923 0.142913

forecast_price_pow_p1 -0.022942 -0.038159 -0.015738 0.029144

has_gas -0.229761 -0.372771 -0.202702 -0.042248

imp_cons 0.139353 0.060609 0.153861 0.382069

margin_gross_pow_ele -0.065186 -0.016866 -0.054069 -0.032531

margin_net_pow_ele -0.045560 -0.008242 -0.037665 -0.021195

nb_prod_act 0.308567 0.272005 0.350711 0.000761

net_margin 0.119908 0.058928 0.096343 0.301896

num_years_antig 0.008810 -0.008626 0.004860 -0.071432

pow_max 0.102422 0.052365 0.089565 0.274834

churn -0.051759 -0.040880 -0.046931 0.038639

price_p1_var -0.016911 -0.012686 -0.009622 -0.119629

price_p2_var 0.141069 0.077770 0.118864 0.137585

price_p3_var 0.059166 0.047497 0.046869 0.162319

price_p1_fix -0.010066 -0.019750 -0.008352 0.050983

price_p2_fix 0.063390 0.042713 0.049753 0.184279

price_p3_fix 0.069990 0.054606 0.055159 0.146502

29 rows × 29 columns

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 24/26

05/01/2021 BCG INTERNSHIP TASK 2

In [38]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year

sns.scatterplot(x="forecast_base_bill_ele", y="forecast_base_bill_year", data=

powerco_data_merge)

Out[38]: <matplotlib.axes._subplots.AxesSubplot at 0x25a4830c6c8>

In [39]: # Based on the correlation figures, a strong correlation occurs between foreca
st_base_bill_ele and forecast_base_bill_year

sns.scatterplot(x="forecast_cons", y="forecast_base_bill_year", data=powerco_d

ata_merge)

Out[39]: <matplotlib.axes._subplots.AxesSubplot at 0x25a487077c8>

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 25/26

05/01/2021 BCG INTERNSHIP TASK 2

In [40]: # Number of clients who also have gas

sns.countplot(x="has_gas",data=powerco_data_merge)

Out[40]: <matplotlib.axes._subplots.AxesSubplot at 0x25a488ea5c8>

In [41]: # Number of customers who churn

sns.countplot(x="churn",data=powerco_data_merge)

Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x25a4892d108>

In [ ]:

localhost:8888/nbconvert/html/BCG INTERNSHIP TASK 2.ipynb?download=false 26/26

Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Uponor Programator I 36 en
No ratings yet
Uponor Programator I 36 en
7 pages
Task 2 Exploratory Data Analysis
No ratings yet
Task 2 Exploratory Data Analysis
5 pages
Task2 Eda Cleaning
No ratings yet
Task2 Eda Cleaning
33 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
16 pages
DM Project
No ratings yet
DM Project
34 pages
DM Project
No ratings yet
DM Project
36 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Sunbase Data Assignment
No ratings yet
Sunbase Data Assignment
11 pages
Exp 3
No ratings yet
Exp 3
10 pages
GRL - EX - 4 (1) .Ipynb - Colaboratory
No ratings yet
GRL - EX - 4 (1) .Ipynb - Colaboratory
7 pages
Customer Churn Analysis 1740361695
No ratings yet
Customer Churn Analysis 1740361695
14 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Data Mining Project - Brahma Chari
No ratings yet
Data Mining Project - Brahma Chari
23 pages
Flipkart Business Analyst Interview Questions
No ratings yet
Flipkart Business Analyst Interview Questions
16 pages
Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
No ratings yet
Analyzing Data Using Python - Cleaning and Analyzing Data in Pandas
81 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Telco Churn Analysis
No ratings yet
Telco Churn Analysis
9 pages
KPMG - Task 1
No ratings yet
KPMG - Task 1
22 pages
chip_analysis
No ratings yet
chip_analysis
2 pages
Supervised Learning Project - Ipynb - Colab
No ratings yet
Supervised Learning Project - Ipynb - Colab
14 pages
ML 5
No ratings yet
ML 5
11 pages
Guides
No ratings yet
Guides
23 pages
PMT2 24
No ratings yet
PMT2 24
56 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Using Big - Fact Customer Table With Proper Unseen Data - Colab
No ratings yet
Using Big - Fact Customer Table With Proper Unseen Data - Colab
21 pages
Exploratry Data Analysis of The Telecom Customer Churn
No ratings yet
Exploratry Data Analysis of The Telecom Customer Churn
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
10) Merging Dataframes: # Detecting Duplicates
No ratings yet
10) Merging Dataframes: # Detecting Duplicates
7 pages
Danmairo - Analysis - Ipynb - Colaboratory
No ratings yet
Danmairo - Analysis - Ipynb - Colaboratory
18 pages
AI10
No ratings yet
AI10
2 pages
Data Cleaning in Machine Learning With Numerical Example
No ratings yet
Data Cleaning in Machine Learning With Numerical Example
3 pages
Exploratory Data Analysis BCG - Ipynb
No ratings yet
Exploratory Data Analysis BCG - Ipynb
273 pages
Vertopal.com Untitled
No ratings yet
Vertopal.com Untitled
31 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Exploratory Data Analysis BCG - Ipynb
No ratings yet
Exploratory Data Analysis BCG - Ipynb
260 pages
Ayush
No ratings yet
Ayush
23 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Practicals IP-12 1-4
No ratings yet
Practicals IP-12 1-4
9 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Untitled0.ipynb - Colab
No ratings yet
Untitled0.ipynb - Colab
6 pages
Analysis
No ratings yet
Analysis
37 pages
Implement K-Means Clustering.: Preprocessing
No ratings yet
Implement K-Means Clustering.: Preprocessing
8 pages
IP Project File CYBER CAFE MANAGEMENT
No ratings yet
IP Project File CYBER CAFE MANAGEMENT
27 pages
AML Project LearnerNotebook LowCode
No ratings yet
AML Project LearnerNotebook LowCode
74 pages
Task 1 Vijaya Lakshman PDF
No ratings yet
Task 1 Vijaya Lakshman PDF
10 pages
Clean Code in Python
50% (2)
Clean Code in Python
35 pages
Customer Churn Analysis - Jupyter Notebook
No ratings yet
Customer Churn Analysis - Jupyter Notebook
10 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Rittik Kumar Naskar
No ratings yet
Rittik Kumar Naskar
19 pages
Microsoft NAV Interview Questions: Unofficial Microsoft Navision Business Solution Certification Review
From Everand
Microsoft NAV Interview Questions: Unofficial Microsoft Navision Business Solution Certification Review
Equity Press
1/5 (1)
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Effective Analytics for Marketing
From Everand
Effective Analytics for Marketing
Sucheta Kakkar
No ratings yet
Stripe Payment Integration for Beginners: A Practical Guide to Accepting Payments Online
From Everand
Stripe Payment Integration for Beginners: A Practical Guide to Accepting Payments Online
Steven Mcananey
No ratings yet
US Coast Guard Auxiliary Boat Crew Program Mentor Guide
No ratings yet
US Coast Guard Auxiliary Boat Crew Program Mentor Guide
5 pages
EIM Demo
No ratings yet
EIM Demo
5 pages
D'EVREUX, Yves. Viagem Ao Norte Do Brasil Feita Nos Annos de 1613 A 1614. 1874 PDF
No ratings yet
D'EVREUX, Yves. Viagem Ao Norte Do Brasil Feita Nos Annos de 1613 A 1614. 1874 PDF
494 pages
William Carroll and Aquinas
100% (1)
William Carroll and Aquinas
17 pages
L-3WarriorSystems 2013 ProductGuide DigitalEdition
100% (1)
L-3WarriorSystems 2013 ProductGuide DigitalEdition
1 page
Elx DD Nic 5.00.31.01-6 Windows 32-64
No ratings yet
Elx DD Nic 5.00.31.01-6 Windows 32-64
4 pages
Adobe Scan 28-Dec-2023
No ratings yet
Adobe Scan 28-Dec-2023
5 pages
Windows Management Instrumentation (WMI) : Lecture (20 Jan)
No ratings yet
Windows Management Instrumentation (WMI) : Lecture (20 Jan)
3 pages
Rlpyt: A Research Code Base For Deep Reinforcement Learning in Pytorch
No ratings yet
Rlpyt: A Research Code Base For Deep Reinforcement Learning in Pytorch
12 pages
The Effectiveness of Using Flashcard Strategy in Teaching Vocabulary For Student in Ringinanyar Elementary School
No ratings yet
The Effectiveness of Using Flashcard Strategy in Teaching Vocabulary For Student in Ringinanyar Elementary School
9 pages
Study of Means End Value Chain Model
100% (1)
Study of Means End Value Chain Model
19 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
5 pages
The Life of Frank Lloyd Wright
100% (4)
The Life of Frank Lloyd Wright
33 pages
Contemporary Topics Unit 12 Vocab
No ratings yet
Contemporary Topics Unit 12 Vocab
2 pages
Grade 6 English1st Term Examination
100% (1)
Grade 6 English1st Term Examination
5 pages
Using A Pendulum
83% (6)
Using A Pendulum
0 pages
Corporate Farm
No ratings yet
Corporate Farm
4 pages
50 Ways To Start A Conversation With Anyone!
No ratings yet
50 Ways To Start A Conversation With Anyone!
11 pages
Standard Break Key Sequence Combinations During Password Recovery
No ratings yet
Standard Break Key Sequence Combinations During Password Recovery
8 pages
Muet Writing Introduction
100% (1)
Muet Writing Introduction
16 pages
Kling
No ratings yet
Kling
1 page
Resume - Jlk10a
No ratings yet
Resume - Jlk10a
5 pages
Mechanical Engineering Routine
No ratings yet
Mechanical Engineering Routine
2 pages
BDSP Poc Guide
No ratings yet
BDSP Poc Guide
31 pages
Johnson H Smith Prasad ABC Kumar Abc 123 Zxy Dke 392 Eiek
No ratings yet
Johnson H Smith Prasad ABC Kumar Abc 123 Zxy Dke 392 Eiek
9 pages
Icra2010 Marder Eppstein
No ratings yet
Icra2010 Marder Eppstein
8 pages
3D Printing
No ratings yet
3D Printing
26 pages
Comparative Study of The Critical Success Factors (CSFS) For Community Resilience Assessment (CRA) in Developed and Developing Countries
No ratings yet
Comparative Study of The Critical Success Factors (CSFS) For Community Resilience Assessment (CRA) in Developed and Developing Countries
13 pages
Pepperberg (1987) Summary Sheet - Study With Mehar
No ratings yet
Pepperberg (1987) Summary Sheet - Study With Mehar
3 pages