0% found this document useful (0 votes)

83 views35 pages

Data Pre Processing 1

The document discusses outlier detection using the Tukey IQR method. It defines outliers as values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1). It also discusses removing outliers using standard deviation by eliminating values above or below the mean by 2.75 standard deviations. Code is provided that identifies and removes outliers in a diabetes dataset based on age values.

Uploaded by

Arika Putri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views35 pages

Data Pre Processing 1

Uploaded by

Arika Putri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Outlier detection - Tukey IQR

https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe

Identiﬁes extreme values in data

Outliers are deﬁned as:

Values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)

Standard deviation from the mean is another common method to detect extreme values
But it can be problematic:

Assumes normality
Sensitive to very extreme values

from IPython.display import Image

Image(filename='Images/tukeyiqr.jpg')

# box and whisker plot (Univariate Plots)

# with this we can determine outliers in dataset
X = pima_df.loc[:,'Pregnancies':'Age']
X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()
Why 1.5 times the width of the box for the outliers? Why does that particular value demark the
diﬀerence between "acceptable" and "unacceptable" values?
Because, when John Tukey was inventing the box-and-whisker plot in 1977 to display these values, he picked
1.5×IQR as the demarkation line for outliers. This has worked well, so we've continued using that value ever
since. If you go further into statistics, you'll ﬁnd that this measure of reasonableness, for bell-curve-shaped
data, means that usually only maybe as much as about one percent of the data will ever be outliers.

Removing Outliers Using Standard Deviation

Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2.75SD)
and any points below (Mean - 2.75SD) before plotting the frequencies.

X = pima_df.loc[:,'Pregnancies':'Age']
outlier_df = X['Age'][((X['Age']-X['Age'].mean()).abs() > 2.75*X['Age'].std())]
print(outlier_df)
out_indices = outlier_df.index
print(out_indices)

import numpy

123 69
221 66
363 67
453 72
459 81
489 67
495 66
537 67
552 66
666 70
674 68
684 69
759 66
Name: Age, dtype: int64
Int64Index([123, 221, 363, 453, 459, 489, 495, 537, 552, 666, 674, 684, 759],
dtype='int64')

X.loc[out_indices] = np.nan
print(X)
mid = X['Age'].median()
X.loc[out_indices] = mid
print(X)

Pregnancies Glucose BloodPressure SkinThickness Insulin \

0 6.0 148.0 72.000000 35.00000 155.548223
1 1.0 85.0 66.000000 29.00000 155.548223
2 8.0 183.0 64.000000 29.15342 155.548223
3 1.0 89.0 66.000000 23.00000 94.000000
4 0.0 137.0 40.000000 35.00000 168.000000
5 5.0 116.0 74.000000 29.15342 155.548223
6 3.0 78.0 50.000000 32.00000 88.000000
7 10.0 115.0 72.405184 29.15342 155.548223
8 2.0 197.0 70.000000 45.00000 543.000000
9 8.0 125.0 96.000000 29.15342 155.548223
10 4.0 110.0 92.000000 29.15342 155.548223
11 10.0 168.0 74.000000 29.15342 155.548223
12 10.0 139.0 80.000000 29.15342 155.548223
13 1.0 189.0 60.000000 23.00000 846.000000
14 5.0 166.0 72.000000 19.00000 175.000000
15 7.0 100.0 72.405184 29.15342 155.548223
16 0.0 118.0 84.000000 47.00000 230.000000
17 7.0 107.0 74.000000 29.15342 155.548223
18 1.0 103.0 30.000000 38.00000 83.000000
19 1.0 115.0 70.000000 30.00000 96.000000
20 3.0 126.0 88.000000 41.00000 235.000000
21 8.0 99.0 84.000000 29.15342 155.548223
22 7.0 196.0 90.000000 29.15342 155.548223
23 9.0 119.0 80.000000 35.00000 155.548223
24 11.0 143.0 94.000000 33.00000 146.000000
25 10.0 125.0 70.000000 26.00000 115.000000
26 7.0 147.0 76.000000 29.15342 155.548223
27 1.0 97.0 66.000000 15.00000 140.000000
28 13.0 145.0 82.000000 19.00000 110.000000
29 5.0 117.0 92.000000 29.15342 155.548223
.. ... ... ... ... ...
738 2.0 99.0 60.000000 17.00000 160.000000
739 1.0 102.0 74.000000 29.15342 155.548223
740 11.0 120.0 80.000000 37.00000 150.000000
741 3.0 102.0 44.000000 20.00000 94.000000
742 1.0 109.0 58.000000 18.00000 116.000000
743 9.0 140.0 94.000000 29.15342 155.548223
744 13.0 153.0 88.000000 37.00000 140.000000
745 12.0 100.0 84.000000 33.00000 105.000000
746 1.0 147.0 94.000000 41.00000 155.548223
747 1.0 81.0 74.000000 41.00000 57.000000
748 3.0 187.0 70.000000 22.00000 200.000000
749 6.0 162.0 62.000000 29.15342 155.548223
750 4.0 136.0 70.000000 29.15342 155.548223
751 1.0 121.0 78.000000 39.00000 74.000000
752 3.0 108.0 62.000000 24.00000 155.548223
753 0.0 181.0 88.000000 44.00000 510.000000
754 8.0 154.0 78.000000 32.00000 155.548223
755 1.0 128.0 88.000000 39.00000 110.000000
756 7.0 137.0 90.000000 41.00000 155.548223
757 0.0 123.0 72.000000 29.15342 155.548223
758 1.0 106.0 76.000000 29.15342 155.548223
759 NaN NaN NaN NaN NaN
760 2.0 88.0 58.000000 26.00000 16.000000
761 9.0 170.0 74.000000 31.00000 155.548223
762 9.0 89.0 62.000000 29.15342 155.548223
763 10.0 101.0 76.000000 48.00000 180.000000
764 2.0 122.0 70.000000 27.00000 155.548223
765 5.0 121.0 72.000000 23.00000 112.000000
766 1.0 126.0 60.000000 29.15342 155.548223
767 1.0 93.0 70.000000 31.00000 155.548223

BMI DiabetesPedigreeFunction Age

0 33.600000 0.627 50.0
1 26.600000 0.351 31.0
2 23.300000 0.672 32.0
3 28.100000 0.167 21.0
4 43.100000 2.288 33.0
5 25.600000 0.201 30.0
6 31.000000 0.248 26.0
7 35.300000 0.134 29.0
8 30.500000 0.158 53.0
9 32.457464 0.232 54.0
10 37.600000 0.191 30.0
11 38.000000 0.537 34.0
12 27.100000 1.441 57.0
13 30.100000 0.398 59.0
14 25.800000 0.587 51.0
15 30.000000 0.484 32.0
16 45.800000 0.551 31.0
17 29.600000 0.254 31.0
18 43.300000 0.183 33.0
19 34.600000 0.529 32.0
20 39.300000 0.704 27.0
21 35.400000 0.388 50.0
22 39.800000 0.451 41.0
23 29.000000 0.263 29.0
24 36.600000 0.254 51.0
25 31.100000 0.205 41.0
26 39.400000 0.257 43.0
27 23.200000 0.487 22.0
28 22.200000 0.245 57.0
29 34.100000 0.337 38.0
.. ... ... ...
738 36.600000 0.453 21.0
739 39.500000 0.293 42.0
740 42.300000 0.785 48.0
741 30.800000 0.400 26.0
742 28.500000 0.219 22.0
743 32.700000 0.734 45.0
744 40.600000 1.174 39.0
745 30.000000 0.488 46.0
746 49.300000 0.358 27.0
747 46.300000 1.096 32.0
748 36.400000 0.408 36.0
749 24.300000 0.178 50.0
750 31.200000 1.182 22.0
751 39.000000 0.261 28.0
752 26.000000 0.223 25.0
753 43.300000 0.222 26.0
754 32.400000 0.443 45.0
755 36.500000 1.057 37.0
756 32.000000 0.391 39.0
757 36.300000 0.258 52.0
758 37.500000 0.197 26.0
759 NaN NaN NaN
760 28.400000 0.766 22.0
761 44.000000 0.403 43.0
762 22.500000 0.142 33.0
763 32.900000 0.171 63.0
764 36.800000 0.340 27.0
765 26.200000 0.245 30.0
766 30.100000 0.349 47.0
767 30.400000 0.315 23.0

[768 rows x 8 columns]

Pregnancies Glucose BloodPressure SkinThickness Insulin \
0 6.0 148.0 72.000000 35.00000 155.548223
1 1.0 85.0 66.000000 29.00000 155.548223
2 8.0 183.0 64.000000 29.15342 155.548223
3 1.0 89.0 66.000000 23.00000 94.000000
4 0.0 137.0 40.000000 35.00000 168.000000
5 5.0 116.0 74.000000 29.15342 155.548223
6 3.0 78.0 50.000000 32.00000 88.000000
7 10.0 115.0 72.405184 29.15342 155.548223
8 2.0 197.0 70.000000 45.00000 543.000000
9 8.0 125.0 96.000000 29.15342 155.548223
10 4.0 110.0 92.000000 29.15342 155.548223
11 10.0 168.0 74.000000 29.15342 155.548223
12 10.0 139.0 80.000000 29.15342 155.548223
13 1.0 189.0 60.000000 23.00000 846.000000
14 5.0 166.0 72.000000 19.00000 175.000000
15 7.0 100.0 72.405184 29.15342 155.548223
16 0.0 118.0 84.000000 47.00000 230.000000
17 7.0 107.0 74.000000 29.15342 155.548223
18 1.0 103.0 30.000000 38.00000 83.000000
19 1.0 115.0 70.000000 30.00000 96.000000
20 3.0 126.0 88.000000 41.00000 235.000000
21 8.0 99.0 84.000000 29.15342 155.548223
22 7.0 196.0 90.000000 29.15342 155.548223
23 9.0 119.0 80.000000 35.00000 155.548223
24 11.0 143.0 94.000000 33.00000 146.000000
25 10.0 125.0 70.000000 26.00000 115.000000
26 7.0 147.0 76.000000 29.15342 155.548223
27 1.0 97.0 66.000000 15.00000 140.000000
28 13.0 145.0 82.000000 19.00000 110.000000
29 5.0 117.0 92.000000 29.15342 155.548223
.. ... ... ... ... ...
738 2.0 99.0 60.000000 17.00000 160.000000
739 1.0 102.0 74.000000 29.15342 155.548223
740 11.0 120.0 80.000000 37.00000 150.000000
741 3.0 102.0 44.000000 20.00000 94.000000
742 1.0 109.0 58.000000 18.00000 116.000000
743 9.0 140.0 94.000000 29.15342 155.548223
744 13.0 153.0 88.000000 37.00000 140.000000
745 12.0 100.0 84.000000 33.00000 105.000000
746 1.0 147.0 94.000000 41.00000 155.548223
747 1.0 81.0 74.000000 41.00000 57.000000
748 3.0 187.0 70.000000 22.00000 200.000000
749 6.0 162.0 62.000000 29.15342 155.548223
750 4.0 136.0 70.000000 29.15342 155.548223
751 1.0 121.0 78.000000 39.00000 74.000000
752 3.0 108.0 62.000000 24.00000 155.548223
753 0.0 181.0 88.000000 44.00000 510.000000
754 8.0 154.0 78.000000 32.00000 155.548223
755 1.0 128.0 88.000000 39.00000 110.000000
756 7.0 137.0 90.000000 41.00000 155.548223
757 0.0 123.0 72.000000 29.15342 155.548223
758 1.0 106.0 76.000000 29.15342 155.548223
759 29.0 29.0 29.000000 29.00000 29.000000
760 2.0 88.0 58.000000 26.00000 16.000000
761 9.0 170.0 74.000000 31.00000 155.548223
762 9.0 89.0 62.000000 29.15342 155.548223
763 10.0 101.0 76.000000 48.00000 180.000000
764 2.0 122.0 70.000000 27.00000 155.548223
765 5.0 121.0 72.000000 23.00000 112.000000
766 1.0 126.0 60.000000 29.15342 155.548223
767 1.0 93.0 70.000000 31.00000 155.548223

BMI DiabetesPedigreeFunction Age

0 33.600000 0.627 50.0
1 26.600000 0.351 31.0
2 23.300000 0.672 32.0
3 28.100000 0.167 21.0
4 43.100000 2.288 33.0
5 25.600000 0.201 30.0
6 31.000000 0.248 26.0
7 35.300000 0.134 29.0
8 30.500000 0.158 53.0
9 32.457464 0.232 54.0
10 37.600000 0.191 30.0
11 38.000000 0.537 34.0
12 27.100000 1.441 57.0
13 30.100000 0.398 59.0
14 25.800000 0.587 51.0
15 30.000000 0.484 32.0
16 45.800000 0.551 31.0
17 29.600000 0.254 31.0
18 43.300000 0.183 33.0
19 34.600000 0.529 32.0
20 39.300000 0.704 27.0
21 35.400000 0.388 50.0
22 39.800000 0.451 41.0
23 29.000000 0.263 29.0
24 36.600000 0.254 51.0
25 31.100000 0.205 41.0
26 39.400000 0.257 43.0
27 23.200000 0.487 22.0
28 22.200000 0.245 57.0
29 34.100000 0.337 38.0
.. ... ... ...
738 36.600000 0.453 21.0
739 39.500000 0.293 42.0
740 42.300000 0.785 48.0
741 30.800000 0.400 26.0
742 28.500000 0.219 22.0
743 32.700000 0.734 45.0
744 40.600000 1.174 39.0
745 30.000000 0.488 46.0
746 49.300000 0.358 27.0
747 46.300000 1.096 32.0
748 36.400000 0.408 36.0
749 24.300000 0.178 50.0
750 31.200000 1.182 22.0
751 39.000000 0.261 28.0
752 26.000000 0.223 25.0
753 43.300000 0.222 26.0
754 32.400000 0.443 45.0
755 36.500000 1.057 37.0
756 32.000000 0.391 39.0
757 36.300000 0.258 52.0
758 37.500000 0.197 26.0
759 29.000000 29.000 29.0
760 28.400000 0.766 22.0
761 44.000000 0.403 43.0
762 22.500000 0.142 33.0
763 32.900000 0.171 63.0
764 36.800000 0.340 27.0
765 26.200000 0.245 30.0
766 30.100000 0.349 47.0
767 30.400000 0.315 23.0

[768 rows x 8 columns]

X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()

#Plotting all of your data: Bee swarm plots

plt.figure(figsize=(15,8))
_ = sns.swarmplot(data=pima_df)
plt.show()

import seaborn as sns

import matplotlib.pyplot as plt
plt.figure(figsize=(15, 6))
_ = sns.swarmplot(x='Age',y='Glucose',hue='Outcome',data=pima_df)
plt.show()

#!pip install plotnine

reference for exploring diﬀerent smooths in ggplot2

import warnings
warnings.filterwarnings("ignore")
from plotnine import *
ggplot(pima_df,aes(x='Age',y='Glucose',colour='Outcome')) +geom_point()+stat_smooth()

<ggplot: (7547461505)>

ggplot(pima_df,aes(x='Age',y='Glucose',colour = 'BloodPressure'))
+geom_point()+stat_smooth()+facet_wrap('~Outcome')

<ggplot: (294288713)>

ggplot(pima_df,aes(x='Age', y
='Pregnancies'))+geom_point(aes(color='BMI'))+facet_wrap('~Outcome')+stat_smooth()

<ggplot: (294281529)>

# correlation of each Point

corr = pima_df.loc[:,pima_df.columns!='Outcome'].corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr,annot=True,cmap="Blues")

<matplotlib.axes._subplots.AxesSubplot at 0x1c1e1a6510>

We can observe that there are correlatiom between some columns
Age is highly correlated with pregnancies
Insulin is correlated with skin Glucose
skin thickness is correlated with BMI

sns.lmplot(x='Age', y = 'Pregnancies', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x1c1e19bbd0>

sns.lmplot(x='Insulin', y = 'Glucose', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x1c1e3ee9d0>

sns.lmplot(x='BMI', y = 'SkinThickness', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x10f8e4390>

#Visualise pairplot using seaborn which will give plot against each attribute to
another attribute
sns.pairplot(pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']])

<seaborn.axisgrid.PairGrid at 0x10fb49750>

4. Data Scaling
Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.
It can help in methods that weight inputs in order to make a prediction, such as in linear regression and
logistic regression. It is practically required in methods that combine weighted inputs in complex ways such
as in artiﬁcial neural networks and deep learning.

We will discuss:

1.Normalise Data
2.Standardize Data
3.When to Normalise and Standardize
1.Normalize Data
Normalization can refer to diﬀerent techniques depending on context. Here, we use normalization to refer
to rescaling an input variable to the range between 0 and 1. Normalization requires that you know the
minimum and maximum values for each attribute. This can be estimated from training data or speciﬁed
directly if you have deep knowledge of the problem domain. You can easily estimate the minimum and
maximum values for each attribute in a dataset by enumerating through the values.

Once we have estimates of the maximum and minimum allowed values for each column, we can normalize
the raw data to the range 0 and 1. The calculation to normalize a single value for a column is:
scaled value = (value - min)/(max - min)

np.set_printoptions(precision=3)
array = np.array(pima_df.values)
print("== Generating data sets ==")

print("diabetes_attr: unchanged, original attributes")

diabetes_attr = array[:,0:8]
label = array[:,8] #unchanged across preprocessing?
diabetes_df = pd.DataFrame(diabetes_attr)

== Generating data sets ==

diabetes_attr: unchanged, original attributes

print("Normalized_attributes: range of 0 to 1")

from sklearn import preprocessing as preproc
scaler = preproc.MinMaxScaler().fit(diabetes_attr)
normalized_attr = scaler.transform(diabetes_attr)
normalized_df = pd.DataFrame(normalized_attr)
print(normalized_df.describe())

Normalized_attributes: range of 0 to 1
0 1 2 3 4 5 \
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 0.226180 0.501205 0.493930 0.240798 0.170130 0.291564
std 0.198210 0.196361 0.123432 0.095554 0.102189 0.140596
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.058824 0.359677 0.408163 0.195652 0.129207 0.190184
50% 0.176471 0.470968 0.491863 0.240798 0.170130 0.290389
75% 0.352941 0.620968 0.571429 0.271739 0.170130 0.376278
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

6 7
count 768.000000 768.000000
mean 0.168179 0.204015
std 0.141473 0.196004
min 0.000000 0.000000
25% 0.070773 0.050000
50% 0.125747 0.133333
75% 0.234095 0.333333
max 1.000000 1.000000

2. Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0
and the standard deviation to the value 1. Together, the mean and the standard deviation can be used to
summarize a normal distribution, also called the Gaussian distribution or bell curve. It requires that the
mean and standard deviation of the values for each column be known prior to scaling. As with normalizing
above, we can estimate these values from training data, or use domain knowledge to specify their values.

The standard deviation describes the average spread of values from the mean. It can be calculated as the
square root of the sum of the squared diﬀerence between each value and the mean and dividing by the
number of values minus 1.

Once mean and standard deviation is calculated we can easily calculate standardized value.The calculation
to standardize a single value for a column is: :
standardized value = (value - mean)/stdev

print("standardized_attr: mean of 0 and stdev of 1")

#scaler = preproc.StandardScaler().fit(diabetes_attr)
#standardized_attr = scaler.transform(diabetes_attr)
standardized_attr = preproc.scale(diabetes_attr)
standardized_df = pd.DataFrame(standardized_attr)
print(standardized_df.describe())

standardized_attr: mean of 0 and stdev of 1

0 1 2 3 4 \
count 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02
mean 2.544261e-17 -3.301757e-16 6.966722e-16 6.866252e-16 -2.352033e-16
std 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00
min -1.141852e+00 -2.554131e+00 -4.004245e+00 -2.521670e+00 -1.665945e+00
25% -8.448851e-01 -7.212214e-01 -6.953060e-01 -4.727737e-01 -4.007289e-01
50% -2.509521e-01 -1.540881e-01 -1.675912e-02 8.087936e-16 -3.345079e-16
75% 6.399473e-01 6.103090e-01 6.282695e-01 3.240194e-01 -3.345079e-16
max 3.906578e+00 2.541850e+00 4.102655e+00 7.950467e+00 8.126238e+00

5 6 7
count 7.680000e+02 7.680000e+02 7.680000e+02
mean 3.090699e-16 2.398978e-16 1.857600e-16
std 1.000652e+00 1.000652e+00 1.000652e+00
min -2.075119e+00 -1.189553e+00 -1.041549e+00
25% -7.215397e-01 -6.889685e-01 -7.862862e-01
50% -8.363615e-03 -3.001282e-01 -3.608474e-01
75% 6.029301e-01 4.662269e-01 6.602056e-01
max 5.042087e+00 5.883565e+00 4.063716e+00

3. When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution. If a given
data attribute is normal or close to normal, this is probably the scaling method to use. It is good practice to
record the summary statistics used in the standardization process so that you can apply them when
standardizing data in the future that you may want to use with your model. Normalization is a scaling
technique that does not assume any specic distribution.

If your data is not normally distributed, consider normalizing it prior to applying your machine learning
algorithm. It is good practice to record the minimum and maximum values for each column used in the
normalization process, again, in case you need to normalize new data in the future to be used with your
model.

Handling Imbalanced Class Data

print("=== undersampling majority class by purging ===")

# Separate majority and minority classes

df_majority = pima_df[pima_df['Outcome']==0]
df_minority = pima_df[pima_df['Outcome']==1]

=== undersampling majority class by purging ===

print("df_minority['class'].size", df_minority['Outcome'].size)
from sklearn.utils import resample
# Downsample majority class

df_majority_downsampled = resample(df_majority,
replace=False, # sample without replacement
n_samples=df_minority['Outcome'].size, # match minority
class
random_state=7) # reproducible results

# Combine minority class with downsampled majority class

df_downsampled = pd.concat([df_majority_downsampled, df_minority])

("df_minority['class'].size", 268)

print("undersampled", df_downsampled.groupby('Outcome').size())
df_downsampled=df_downsampled.sample(frac=1).reset_index(drop=True)
undersampling_attr = np.array(df_downsampled.values[:,0:8])
undersampling_label = np.array(df_downsampled.values[:,8])
('undersampled', Outcome
0 268
1 268
dtype: int64)

#!pip install imblearn

print("=== oversampling minority class with SMOTE ===")

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=7)
x_val = pima_df.values[:,0:8]
y_val = pima_df.values[:,8]
X_res, y_res = sm.fit_sample(x_val, y_val)

features=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

'DiabetesPedigreeFunction', 'Age']
oversampled_df = pd.DataFrame(X_res)
oversampled_df.columns = features
oversampled_df = oversampled_df.assign(label = np.asarray(y_res))
oversampled_df = oversampled_df.sample(frac=1).reset_index(drop=True)

oversampling_attr = oversampled_df.values[:,0:8]
oversampling_label = oversampled_df.values[:,8]
print("oversampled_df", oversampled_df.groupby('label').size())

=== oversampling minority class with SMOTE ===

('oversampled_df', label
0.0 500
1.0 500
dtype: int64)

print("== treating missing values by purging or imputating ==")

## missing.arff
print("=== Assuming, zero indicates missing values === ")
print("missing values by count")
print((pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())
print("=== purging ===")
# make a copy of original data set
dataset_cp = pima_df.copy(deep=True)

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] =

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0,
np.NaN)
== treating missing values by purging or imputating ==
=== Assuming, zero indicates missing values ===
missing values by count
Pregnancies 111
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
dtype: int64
=== purging ===

# dataset with missing values

dataset_missing = dataset_cp.dropna()

# summarize the number of rows and columns in the dataset

print(dataset_cp.shape)

missing_attr = np.array(dataset_missing.values[:,0:8])
missing_label = np.array(dataset_missing.values[:,8])

print("=== imputing by replacing missing values with mean column values ===")

dataset_impute = dataset_cp.fillna(dataset_cp.mean())
# count the number of NaN values in each column
print(dataset_impute.isnull().sum())

print("== addressing class imbalance under or over sampling ==")

impute_attr = np.array(dataset_impute.values[:,0:8])

(768, 9)
=== imputing by replacing missing values with mean column values ===
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
== addressing class imbalance under or over sampling ==

Dimensionality reduction using PCA

Principal component analysis (PCA) is a technique that transforms a dataset of many features into
principal components that "summarize" the variance that underlies the data

Each principal component is calculated by ﬁnding the linear combination of features that maximizes
variance, while also ensuring zero correlation with the previously calculated principal components

Use cases for modeling:

One of the most common dimensionality reduction techniques

Use if there are too many features or if observation/feature ratio is poor
Also, potentially good option if there are a lot of highly correlated variables in your dataset

Unfortunately, PCA makes models a lot harder to interpret

PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal
components, resulting in a lower-dimensional projection of the data that preserves the maximal data
variance.
Choosing the number of components
A vital part of using PCA in practice is the ability to estimate how many components are needed to describe
the data. This can be determined by looking at the cumulative explained variance ratio as a function of the
number of components:

# Use PCA from sklearn.decompostion to find principal components

from sklearn.decomposition import PCA
pca = PCA()
X_pca = pd.DataFrame(pca.fit_transform(pima_df))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

pca = PCA(n_components=5)
pca.fit(diabetes_attr)
diabetes_attr_pca = pca.transform(diabetes_attr)
print("original shape: ", diabetes_attr.shape)
print("transformed shape:", diabetes_attr_pca.shape)

('original shape: ', (768, 8))

('transformed shape:', (768, 5))

pca.fit(normalized_attr)
normalized_attr_pca = pca.transform(normalized_attr)

pca.fit(standardized_attr)
standardized_attr_pca = pca.transform(standardized_attr)

pca.fit(impute_attr)
impute_attr_pca = pca.transform(impute_attr)

pca.fit(missing_attr)
missing_attr_pca = pca.transform(missing_attr)

pca.fit(undersampling_attr)
undersampling_attr_pca = pca.transform(undersampling_attr)

pca.fit(oversampling_attr)
oversampling_attr_pca = pca.transform(oversampling_attr)

Evaluate Algorithms
print(" == Evaluate Some Algorithms == ")
# Split-out validation dataset
print(" == Create a Validation Dataset: Split-out validation dataset == ")

# Test options and evaluation metric

print(" == Test Harness: Test options and evaluation metric == ")
seed = 7
scoring = 'accuracy'

== Evaluate Some Algorithms ==

== Create a Validation Dataset: Split-out validation dataset ==
== Test Harness: Test options and evaluation metric ==

# Spot Check Algorithms without feature reduction

# algo eval imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# significance tests
import scipy.stats as stats
import math

print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr, label))
datasets.append(('normalized_attr', normalized_attr, label))
datasets.append(('standardized_attr', standardized_attr, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr, missing_label))
datasets.append(('undersampling_attr', undersampling_attr, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr, oversampling_label))

models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))

print("eval metric: " + scoring)

for dataname, attributes, target in datasets:
# evaluate each model in turn
results = []
names = []
print("= " + dataname + " = ")
print("algorithm,mean,std,signficance,p-val")
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, attributes, target,
cv=kfold, scoring=scoring)
results.append(cv_results)
#print("cv_results")
#print(cv_results)
#print(results[0])
names.append(name)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

#print("LR vs ", name, t,prob)
# Below 0.05, significant. Over 0.05, not significant.
# http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-
your-p-value-is-greater-than-005
statistically_different = (prob < 0.05)

msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),

statistically_different, prob)
print(msg)

# Compare Algorithms
print(" == Select Best Model, Compare Algorithms == ")
fig = plt.figure()
fig.suptitle('Algorithm Comparison for ' + dataname)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel(scoring)
ax.set_xticklabels(names)
plt.show()

== Build Models: build and evaluate models, Spot Check Algorithms ==

eval metric: accuracy
= diabetes_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687474 (0.063816) True 0.002141
NB: 0.747386 (0.043583) False 0.203854
RF: 0.744737 (0.064272) False 0.184573
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==

= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765619 (0.046566) False nan
LDA: 0.766951 (0.052975) False 0.828238
KNN: 0.748701 (0.062006) False 0.235048
CART: 0.700496 (0.048400) True 0.001043
NB: 0.747386 (0.043583) False 0.132240
RF: 0.746036 (0.058189) False 0.061152
SVM: 0.770813 (0.052488) False 0.309233
== Select Best Model, Compare Algorithms ==

= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.770813 (0.051248) False nan
LDA: 0.766951 (0.052975) False 0.526999
KNN: 0.738278 (0.039157) True 0.030019
CART: 0.687440 (0.063132) True 0.001104
NB: 0.747386 (0.043583) False 0.061474
RF: 0.757707 (0.060612) False 0.356660
SVM: 0.753913 (0.044789) True 0.022590
== Select Best Model, Compare Algorithms ==

= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764320 (0.048484) False nan
LDA: 0.766951 (0.052975) False 0.675096
KNN: 0.713534 (0.064980) True 0.014497
CART: 0.696617 (0.055419) True 0.000589
NB: 0.747386 (0.043583) False 0.243960
RF: 0.764234 (0.057085) False 0.996000
SVM: 0.651025 (0.072141) True 0.000669
== Select Best Model, Compare Algorithms ==

= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764337 (0.047320) False nan
LDA: 0.766951 (0.052975) False 0.640480
KNN: 0.713534 (0.064980) True 0.014492
CART: 0.691353 (0.063152) True 0.002221
NB: 0.747386 (0.043583) False 0.220344
RF: 0.734330 (0.062398) True 0.037339
SVM: 0.651025 (0.072141) True 0.000570
== Select Best Model, Compare Algorithms ==

= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.749895 (0.053692) False nan
LDA: 0.751747 (0.071419) False 0.897521
KNN: 0.694165 (0.071292) True 0.009508
CART: 0.663941 (0.081059) True 0.016613
NB: 0.710936 (0.081741) False 0.058574
RF: 0.720335 (0.059445) False 0.210477
SVM: 0.458910 (0.072346) True 0.000001
== Select Best Model, Compare Algorithms ==

= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.755000 (0.051039) False nan
LDA: 0.749000 (0.045486) False 0.111373
KNN: 0.768000 (0.027129) False 0.481468
CART: 0.762000 (0.028213) False 0.677050
NB: 0.713000 (0.046054) True 0.001323
RF: 0.814000 (0.045869) True 0.002612
SVM: 0.713000 (0.043829) False 0.090454
== Select Best Model, Compare Algorithms ==

# Spot Check Algorithms after feature reduction with pca

# significance tests
import scipy.stats as stats
import math

print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr_pca, label))
datasets.append(('normalized_attr', normalized_attr_pca, label))
datasets.append(('standardized_attr', standardized_attr_pca, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr_pca, missing_label))
datasets.append(('undersampling_attr', undersampling_attr_pca, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr_pca, oversampling_label))

print("eval metric: " + scoring)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

#print("LR vs ", name, t,prob)
# Below 0.05, significant. Over 0.05, not significant.
statistically_different = (prob < 0.05)
msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),
statistically_different, prob)
print(msg)

== Build Models: build and evaluate models, Spot Check Algorithms ==

eval metric: accuracy
= diabetes_attr =
algorithm,mean,std,signficance,p-val
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.670540 (0.066640) True 0.001672
NB: 0.747454 (0.048375) False 0.195421
RF: 0.713551 (0.053480) True 0.009030
SVM: 0.651025 (0.072141) True 0.000815
== Select Best Model, Compare Algorithms ==

= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.763004 (0.052644) False nan
LDA: 0.759091 (0.049164) False 0.432778
KNN: 0.720010 (0.064269) True 0.001669
CART: 0.654802 (0.061094) True 0.000207
NB: 0.742208 (0.045907) False 0.140785
RF: 0.740858 (0.063580) False 0.123441
SVM: 0.772095 (0.054777) True 0.009535
== Select Best Model, Compare Algorithms ==

= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.748701 (0.033960) False nan
LDA: 0.742208 (0.031646) False 0.272912
KNN: 0.718763 (0.051160) True 0.033641
CART: 0.704323 (0.043981) True 0.007545
NB: 0.721343 (0.035560) True 0.035025
RF: 0.716131 (0.047187) True 0.008683
SVM: 0.733083 (0.046566) False 0.179971
== Select Best Model, Compare Algorithms ==

= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687389 (0.049055) True 0.000286
NB: 0.747386 (0.043583) False 0.203854
RF: 0.751299 (0.053382) False 0.169036
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==

= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.663995 (0.055143) True 0.000938
NB: 0.747454 (0.048375) False 0.195421
RF: 0.731716 (0.058688) False 0.065855
SVM: 0.651025 (0.072141) True 0.000815
== Select Best Model, Compare Algorithms ==

= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.716387 (0.058158) False nan
LDA: 0.716282 (0.060502) False 0.983147
KNN: 0.690426 (0.067903) False 0.208882
CART: 0.673620 (0.065603) False 0.115618
NB: 0.701398 (0.058597) False 0.206335
RF: 0.686513 (0.074474) False 0.179856
SVM: 0.451468 (0.063015) True 0.000002
== Select Best Model, Compare Algorithms ==

= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.711000 (0.038588) False nan
LDA: 0.717000 (0.040262) False 0.051003
KNN: 0.762000 (0.023580) True 0.000407
CART: 0.742000 (0.049960) False 0.135066
NB: 0.708000 (0.050951) False 0.802536
RF: 0.767000 (0.024920) True 0.001139
SVM: 0.684000 (0.052192) False 0.261975
== Select Best Model, Compare Algorithms ==

(Nijhoff International Philosophy Series) Stanislaw Lesniewski - S. J. Surma Et Al. (Eds.) - Collected Works. 1, 2-Springer (1991)
100% (4)
(Nijhoff International Philosophy Series) Stanislaw Lesniewski - S. J. Surma Et Al. (Eds.) - Collected Works. 1, 2-Springer (1991)
408 pages
100+ Java Interview Questions and Answers
No ratings yet
100+ Java Interview Questions and Answers
11 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Communicative English Grammar
100% (2)
Communicative English Grammar
130 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
DSBDA 3A
No ratings yet
DSBDA 3A
11 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Ml1.ipynb - Colaboratory
No ratings yet
Ml1.ipynb - Colaboratory
5 pages
Data Loading- Jupyter Notebook
No ratings yet
Data Loading- Jupyter Notebook
15 pages
Week 13 1-Pandas
No ratings yet
Week 13 1-Pandas
10 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
Pandas
No ratings yet
Pandas
4 pages
Pima Indian Diabetes Prediction
No ratings yet
Pima Indian Diabetes Prediction
22 pages
Astrology Levels Stock Setup
No ratings yet
Astrology Levels Stock Setup
222 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Heart Diseases EDA
No ratings yet
Heart Diseases EDA
1 page
Documents Downloader
No ratings yet
Documents Downloader
16 pages
DAL Experiment Outputs 6to10
No ratings yet
DAL Experiment Outputs 6to10
16 pages
vertopal.com_python2025
No ratings yet
vertopal.com_python2025
25 pages
Tabla de Frecuencias Personas Con Empleo: Numero de Datos Min Max Rango
No ratings yet
Tabla de Frecuencias Personas Con Empleo: Numero de Datos Min Max Rango
5 pages
Lecture 08 Nonlinearity
No ratings yet
Lecture 08 Nonlinearity
26 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
No ratings yet
ML Mini Project: Name: Sarvesh Muttepwar Class: BE COMP (A) Roll No: 21CEBEB11
12 pages
1 Simple Linear Regression
No ratings yet
1 Simple Linear Regression
9 pages
Project 3 - Diabetes Prediction.ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction.ipynb - Colab
4 pages
Fan Calc Sheet
No ratings yet
Fan Calc Sheet
16 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Sampling Distribution 556 G
No ratings yet
Sampling Distribution 556 G
2 pages
Terrassement Et Volume de Chaussee A4
No ratings yet
Terrassement Et Volume de Chaussee A4
3 pages
Data Covid-19 Jakarta: Numpy NP Matplotlib - Pyplot PLT Ipython - Display
No ratings yet
Data Covid-19 Jakarta: Numpy NP Matplotlib - Pyplot PLT Ipython - Display
3 pages
Name and Formula
No ratings yet
Name and Formula
3 pages
RDF
No ratings yet
RDF
21 pages
Xây dựng và kiểm định HQTT R -
No ratings yet
Xây dựng và kiểm định HQTT R -
6 pages
Instructions To Create A Box Plot
No ratings yet
Instructions To Create A Box Plot
22 pages
Dự báo và phát triển kinh doanh
No ratings yet
Dự báo và phát triển kinh doanh
43 pages
Camp Class Moving Cluster
No ratings yet
Camp Class Moving Cluster
22 pages
Custom Effect Dynamic
No ratings yet
Custom Effect Dynamic
26 pages
Gigir 1
No ratings yet
Gigir 1
788 pages
JCPDScardno 024-0735
No ratings yet
JCPDScardno 024-0735
3 pages
Results For: Hydraulic For Yard Area
No ratings yet
Results For: Hydraulic For Yard Area
15 pages
grin5
No ratings yet
grin5
4 pages
WWF
No ratings yet
WWF
268 pages
Steam Tables (English Units)
No ratings yet
Steam Tables (English Units)
3 pages
Da
No ratings yet
Da
2 pages
MAT 2001 Challenging Experiment 2
No ratings yet
MAT 2001 Challenging Experiment 2
12 pages
Scroll Coompressor 1
No ratings yet
Scroll Coompressor 1
81 pages
SMG Estimation of Survival Function
No ratings yet
SMG Estimation of Survival Function
8 pages
DP v8
No ratings yet
DP v8
19 pages
grin7
No ratings yet
grin7
4 pages
Montecarlo Sample
No ratings yet
Montecarlo Sample
51 pages
Soil Lab
No ratings yet
Soil Lab
5 pages
EXAMEN
No ratings yet
EXAMEN
11 pages
11zon - Merged-Files (1) - Removed - Removed
No ratings yet
11zon - Merged-Files (1) - Removed - Removed
7 pages
New Cases New Deaths: MC 180403307 Qno:1
No ratings yet
New Cases New Deaths: MC 180403307 Qno:1
4 pages
Merged
No ratings yet
Merged
35 pages
Statistics Exp 1
100% (1)
Statistics Exp 1
15 pages
Regression 2
No ratings yet
Regression 2
52 pages
Database Fe2o3 Gamma
No ratings yet
Database Fe2o3 Gamma
3 pages
Steps Noodles Restaurant
No ratings yet
Steps Noodles Restaurant
32 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
Employees Burnout Analysis
No ratings yet
Employees Burnout Analysis
20 pages
A List of Factorial Math Constants
From Everand
A List of Factorial Math Constants
StreetLib
No ratings yet
The Fibonacci Number Series
From Everand
The Fibonacci Number Series
Michael Husted
5/5 (1)
2021 FOSDEM Idmapped Mounts
No ratings yet
2021 FOSDEM Idmapped Mounts
11 pages
Tos Assignment No: 2: Shell Structures
No ratings yet
Tos Assignment No: 2: Shell Structures
5 pages
Project of Sanitary Engineering: Assignment 2: Design of A Water Distribution and Wastewater Drainage Systems
No ratings yet
Project of Sanitary Engineering: Assignment 2: Design of A Water Distribution and Wastewater Drainage Systems
3 pages
22565 2023 Summer Question Paper[Msbte Study Resources]
No ratings yet
22565 2023 Summer Question Paper[Msbte Study Resources]
4 pages
30 Top Most Magnetic Circuit - Electrical Engineering Multiple Choice Questions and Answers
No ratings yet
30 Top Most Magnetic Circuit - Electrical Engineering Multiple Choice Questions and Answers
6 pages
Chapter 6
50% (2)
Chapter 6
66 pages
Manual de Specview 3.04 PDF
No ratings yet
Manual de Specview 3.04 PDF
324 pages
Enzyme Electrode
100% (1)
Enzyme Electrode
10 pages
DNN Full Merged Compressed Compressed
No ratings yet
DNN Full Merged Compressed Compressed
863 pages
Mechanical Engineering Department DJF41042 - CADCAM Practical Task 1 SESSION 2: 2022/2023
No ratings yet
Mechanical Engineering Department DJF41042 - CADCAM Practical Task 1 SESSION 2: 2022/2023
14 pages
Rexroth Motors PDF
No ratings yet
Rexroth Motors PDF
7 pages
A New Approach To Parts of Speech Tagging in Malayalam
No ratings yet
A New Approach To Parts of Speech Tagging in Malayalam
10 pages
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
100% (9)
Get Calculations for Molecular Biology and Biotechnology Frank Harold Stephenson free all chapters
82 pages
X-Ray Tube in CT Scanner PDF
No ratings yet
X-Ray Tube in CT Scanner PDF
17 pages
SM-70 CMR Revised
No ratings yet
SM-70 CMR Revised
1 page
Acumulador de Freno - Pruebas
No ratings yet
Acumulador de Freno - Pruebas
1 page
Quotation For CFD Analysis of Heater (Shell and Tube) - SAB
No ratings yet
Quotation For CFD Analysis of Heater (Shell and Tube) - SAB
4 pages
Database
No ratings yet
Database
11 pages
Recount Text
No ratings yet
Recount Text
15 pages
Problem Statement: Advanced Structural Mechanics M Nicholas Fantuzzi
No ratings yet
Problem Statement: Advanced Structural Mechanics M Nicholas Fantuzzi
4 pages
B.ES Curriculum
No ratings yet
B.ES Curriculum
12 pages
Walk-Through Piping-By Ashish Shrivastava
No ratings yet
Walk-Through Piping-By Ashish Shrivastava
36 pages
Summative Test 3 Q2
No ratings yet
Summative Test 3 Q2
2 pages
csc264 Answers
No ratings yet
csc264 Answers
12 pages
Oracle Database Communication Protocol PDF
No ratings yet
Oracle Database Communication Protocol PDF
65 pages
Experimental Investigation of Three-Phase Low-Liquid-Loading Flow
No ratings yet
Experimental Investigation of Three-Phase Low-Liquid-Loading Flow
12 pages
Group 8
No ratings yet
Group 8
62 pages

Data Pre Processing 1

Uploaded by

Data Pre Processing 1

Uploaded by

Outlier detection - Tukey IQR

Identiﬁes extreme values in data

Outliers are deﬁned as:

from IPython.display import Image

# box and whisker plot (Univariate Plots)

Removing Outliers Using Standard Deviation

Pregnancies Glucose BloodPressure SkinThickness Insulin \

BMI DiabetesPedigreeFunction Age

[768 rows x 8 columns]

BMI DiabetesPedigreeFunction Age

[768 rows x 8 columns]

#Plotting all of your data: Bee swarm plots

import seaborn as sns

#!pip install plotnine

reference for exploring diﬀerent smooths in ggplot2

# correlation of each Point

sns.lmplot(x='Age', y = 'Pregnancies', hue = 'Outcome', data = pima_df)

sns.lmplot(x='Insulin', y = 'Glucose', hue = 'Outcome', data = pima_df)

print("diabetes_attr: unchanged, original attributes")

== Generating data sets ==

print("Normalized_attributes: range of 0 to 1")

print("standardized_attr: mean of 0 and stdev of 1")

standardized_attr: mean of 0 and stdev of 1

3. When to Normalize and Standardize

Handling Imbalanced Class Data

print("=== undersampling majority class by purging ===")

# Separate majority and minority classes

=== undersampling majority class by purging ===

# Combine minority class with downsampled majority class

#!pip install imblearn

print("=== oversampling minority class with SMOTE ===")

features=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

=== oversampling minority class with SMOTE ===

print("== treating missing values by purging or imputating ==")

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] =

# dataset with missing values

# summarize the number of rows and columns in the dataset

print("== addressing class imbalance under or over sampling ==")

Dimensionality reduction using PCA

Use cases for modeling:

One of the most common dimensionality reduction techniques

Unfortunately, PCA makes models a lot harder to interpret

PCA as dimensionality reduction

# Use PCA from sklearn.decompostion to find principal components

('original shape: ', (768, 8))

# Test options and evaluation metric

== Evaluate Some Algorithms ==

# Spot Check Algorithms without feature reduction

print("eval metric: " + scoring)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),

== Build Models: build and evaluate models, Spot Check Algorithms ==

# Spot Check Algorithms after feature reduction with pca

print("eval metric: " + scoring)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

== Build Models: build and evaluate models, Spot Check Algorithms ==

You might also like