0% found this document useful (0 votes)
11 views15 pages

Outlier Treatment - Jupyter Notebook

The document discusses outlier treatment in a dataset using a Jupyter Notebook, focusing on the relationship between dependent and independent variables. It provides an example with a dataset containing information about individuals' age, sex, BMI, children, smoking status, region, and charges. The analysis includes steps for detecting and removing outliers based on the Interquartile Range (IQR) method, specifically for the BMI variable.

Uploaded by

saikiran31032003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Outlier Treatment - Jupyter Notebook

The document discusses outlier treatment in a dataset using a Jupyter Notebook, focusing on the relationship between dependent and independent variables. It provides an example with a dataset containing information about individuals' age, sex, BMI, children, smoking status, region, and charges. The analysis includes steps for detecting and removing outliers based on the Interquartile Range (IQR) method, specifically for the BMI variable.

Uploaded by

saikiran31032003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

y is dependent variable.

x1,x2,x3,x4,....... are independent variables.

a0 = constant/intercept.

ai = slope of coeffiecents for each independent variabes.

y = a0+a1x+a2+x2+a3x3+........+ai*xi.

ex : salary is dependent variable.

designation and experience is indepedent variables.

based on the size,no.of BHK, location and trasoport price of the house will depend

Independent variables are not highly correlated to each other

In [1]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]: df = pd.read_csv("insurance.csv")

In [3]: df

Out[3]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 1/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [4]: df.head()

Out[4]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

In [5]: df.shape

Out[5]: (1338, 7)

In [6]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

In [7]: df.describe()

Out[7]: age bmi children charges

count 1338.000000 1338.000000 1338.000000 1338.000000

mean 39.207025 30.663397 1.094918 13270.422265

std 14.049960 6.098187 1.205493 12110.011237

min 18.000000 15.960000 0.000000 1121.873900

25% 27.000000 26.296250 0.000000 4740.287150

50% 39.000000 30.400000 1.000000 9382.033000

75% 51.000000 34.693750 2.000000 16639.912515

max 64.000000 53.130000 5.000000 63770.428010

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 2/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [8]: df.isnull().any()

Out[8]: age False


sex False
bmi False
children False
smoker False
region False
charges False
dtype: bool

In [9]: df.isnull().sum()

Out[9]: age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64

In [10]: cor=df.corr()

In [11]: cor

Out[11]: age bmi children charges

age 1.000000 0.109272 0.042469 0.299008

bmi 0.109272 1.000000 0.012759 0.198341

children 0.042469 0.012759 1.000000 0.067998

charges 0.299008 0.198341 0.067998 1.000000

In [12]: sns.heatmap(cor,data=df)

Out[12]: <AxesSubplot:>

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 3/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

OUTLIER DETECTION
In [13]: sns.boxplot(df.bmi)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(

Out[13]: <AxesSubplot:xlabel='bmi'>

outlier removal by replacement with median :


In [16]: q1 = df.bmi.quantile(0.25)
q3 = df.bmi.quantile(0.75)

In [17]: q1

Out[17]: 26.29625

In [18]: q3

Out[18]: 34.69375

In [19]: IQR = q3-q1

In [20]: IQR

Out[20]: 8.3975

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 4/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [21]: upper_limit = q3+1.5*IQR

In [22]: upper_limit

Out[22]: 47.290000000000006

In [23]: lower_limit = q1-1.5*IQR

In [24]: lower_limit

Out[24]: 13.7

In [25]: #but we have ouliers only in upper limit area so we should consider
#only upper_limit

In [29]: med = df.median()

/var/folders/hl/c5h7c5j13bzcnhyv6780pcmc0000gn/T/ipykernel_1006/290833748
5.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductio
ns (with 'numeric_only=None') is deprecated; in a future version this wil
l raise TypeError. Select only valid columns before calling the reductio
n.
med = df.median()

In [31]: mm = df["bmi"].median()
mm

Out[31]: 30.4

In [32]: df["bmi"]=np.where(df["bmi"]>upper_limit,mm,df["bmi"])

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 5/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [33]: sns.boxplot(df.bmi)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(

Out[33]: <AxesSubplot:xlabel='bmi'>

In [34]: df.shape

Out[34]: (1338, 7)

Removal Method :

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 6/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [35]: sns.boxplot(df.bmi)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(

Out[35]: <AxesSubplot:xlabel='bmi'>

In [36]: dff = pd.read_csv("insurance.csv")

In [37]: dff

Out[37]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 7/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [38]: sns.boxplot(dff.bmi)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(

Out[38]: <AxesSubplot:xlabel='bmi'>

In [39]: q1 = dff.bmi.quantile(0.25)
q3 = dff.bmi.quantile(0.75)

In [42]: q1

Out[42]: 26.29625

In [43]: q3

Out[43]: 34.69375

In [44]: IQR = q3-q1

In [45]: IQR

Out[45]: 8.3975

In [46]: upper_limit = q3+1.5*IQR

In [47]: upper_limit

Out[47]: 47.290000000000006

In [48]: lower_limit = q1-1.5*IQR

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 8/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [49]: lower_limit

Out[49]: 13.7

In [50]: dff = dff[dff.bmi<upper_limit]

In [51]: dff

Out[51]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1329 rows × 7 columns

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 9/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [52]: sns.boxplot(dff.bmi)

/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(

Out[52]: <AxesSubplot:xlabel='bmi'>

In [53]: dff.shape

Out[53]: (1329, 7)

Z-Score method :
In [54]: dfff = pd.read_csv("insurance.csv")

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 10/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [55]: dfff

Out[55]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [57]: sns.boxplot(dfff.bmi)

Out[57]: <AxesSubplot:xlabel='bmi'>

In [59]: from scipy import stats

In [60]: bmi_zscore=stats.zscore(dfff.bmi)

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 11/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [61]: bmi_zscore

Out[61]: 0 -0.453320
1 0.509621
2 0.383307
3 -1.305531
4 -0.292556
...
1333 0.050297
1334 0.206139
1335 1.014878
1336 -0.797813
1337 -0.261388
Name: bmi, Length: 1338, dtype: float64

In [62]: df_z=df[np.abs(bmi_zscore)<=3] #in zscore we should use 3(norma distributio

In [63]: df_z

Out[63]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1334 rows × 7 columns

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 12/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [65]: sns.boxplot(df_z.bmi)

Out[65]: <AxesSubplot:xlabel='bmi'>

Percentile :
In [66]: p99 = df.bmi.quantile(0.99)

In [67]: p99

Out[67]: 44.72834999999999

In [68]: df5 = pd.read_csv("insurance.csv")

In [69]: df5

Out[69]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 13/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [70]: p99 = df5.bmi.quantile(0.99)

In [71]: p99

Out[71]: 46.40789999999996

In [72]: df6 = df[df.bmi<=99]

In [73]: df6

Out[73]: age sex bmi children smoker region charges

0 19 female 27.900 0 yes southwest 16884.92400

1 18 male 33.770 1 no southeast 1725.55230

2 28 male 33.000 3 no southeast 4449.46200

3 33 male 22.705 0 no northwest 21984.47061

4 32 male 28.880 0 no northwest 3866.85520

... ... ... ... ... ... ... ...

1333 50 male 30.970 3 no northwest 10600.54830

1334 18 female 31.920 0 no northeast 2205.98080

1335 18 female 36.850 0 no southeast 1629.83350

1336 21 female 25.800 0 no southwest 2007.94500

1337 61 female 29.070 0 yes northwest 29141.36030

1338 rows × 7 columns

In [75]: sns.boxplot(df6.bmi)

Out[75]: <AxesSubplot:xlabel='bmi'>

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 14/15


9/15/23, 10:13 AM Outlier treatment - Jupyter Notebook

In [ ]: ​

localhost:8888/notebooks/SMARTINTERN/Outlier treatment.ipynb 15/15

You might also like