Outlier Treatment - Jupyter Notebook
Outlier Treatment - Jupyter Notebook
y is dependent variable.
a0 = constant/intercept.
y = a0+a1x+a2+x2+a3x3+........+ai*xi.
based on the size,no.of BHK, location and trasoport price of the house will depend
In [2]: df = pd.read_csv("insurance.csv")
In [3]: df
In [4]: df.head()
In [5]: df.shape
Out[5]: (1338, 7)
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
In [7]: df.describe()
In [8]: df.isnull().any()
In [9]: df.isnull().sum()
Out[9]: age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
In [10]: cor=df.corr()
In [11]: cor
In [12]: sns.heatmap(cor,data=df)
Out[12]: <AxesSubplot:>
OUTLIER DETECTION
In [13]: sns.boxplot(df.bmi)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
Out[13]: <AxesSubplot:xlabel='bmi'>
In [17]: q1
Out[17]: 26.29625
In [18]: q3
Out[18]: 34.69375
In [20]: IQR
Out[20]: 8.3975
In [22]: upper_limit
Out[22]: 47.290000000000006
In [24]: lower_limit
Out[24]: 13.7
In [25]: #but we have ouliers only in upper limit area so we should consider
#only upper_limit
/var/folders/hl/c5h7c5j13bzcnhyv6780pcmc0000gn/T/ipykernel_1006/290833748
5.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductio
ns (with 'numeric_only=None') is deprecated; in a future version this wil
l raise TypeError. Select only valid columns before calling the reductio
n.
med = df.median()
In [31]: mm = df["bmi"].median()
mm
Out[31]: 30.4
In [32]: df["bmi"]=np.where(df["bmi"]>upper_limit,mm,df["bmi"])
In [33]: sns.boxplot(df.bmi)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
Out[33]: <AxesSubplot:xlabel='bmi'>
In [34]: df.shape
Out[34]: (1338, 7)
Removal Method :
In [35]: sns.boxplot(df.bmi)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
Out[35]: <AxesSubplot:xlabel='bmi'>
In [37]: dff
In [38]: sns.boxplot(dff.bmi)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
Out[38]: <AxesSubplot:xlabel='bmi'>
In [39]: q1 = dff.bmi.quantile(0.25)
q3 = dff.bmi.quantile(0.75)
In [42]: q1
Out[42]: 26.29625
In [43]: q3
Out[43]: 34.69375
In [45]: IQR
Out[45]: 8.3975
In [47]: upper_limit
Out[47]: 47.290000000000006
In [49]: lower_limit
Out[49]: 13.7
In [51]: dff
In [52]: sns.boxplot(dff.bmi)
/opt/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py:36: Fut
ureWarning: Pass the following variable as a keyword arg: x. From version
0.12, the only valid positional argument will be `data`, and passing othe
r arguments without an explicit keyword will result in an error or misint
erpretation.
warnings.warn(
Out[52]: <AxesSubplot:xlabel='bmi'>
In [53]: dff.shape
Out[53]: (1329, 7)
Z-Score method :
In [54]: dfff = pd.read_csv("insurance.csv")
In [55]: dfff
In [57]: sns.boxplot(dfff.bmi)
Out[57]: <AxesSubplot:xlabel='bmi'>
In [60]: bmi_zscore=stats.zscore(dfff.bmi)
In [61]: bmi_zscore
Out[61]: 0 -0.453320
1 0.509621
2 0.383307
3 -1.305531
4 -0.292556
...
1333 0.050297
1334 0.206139
1335 1.014878
1336 -0.797813
1337 -0.261388
Name: bmi, Length: 1338, dtype: float64
In [63]: df_z
In [65]: sns.boxplot(df_z.bmi)
Out[65]: <AxesSubplot:xlabel='bmi'>
Percentile :
In [66]: p99 = df.bmi.quantile(0.99)
In [67]: p99
Out[67]: 44.72834999999999
In [69]: df5
In [71]: p99
Out[71]: 46.40789999999996
In [73]: df6
In [75]: sns.boxplot(df6.bmi)
Out[75]: <AxesSubplot:xlabel='bmi'>
In [ ]: