DSBDA GRP A Print
DSBDA GRP A Print
Out[3]: Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Hum
2008-
0 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ...
12-01
2008-
1 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ...
12-02
2008-
2 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ...
12-03
2008-
3 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ...
12-04
2008-
4 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ...
12-05
2008-
5 Albury 14.6 29.7 0.2 NaN NaN WNW 56.0 W ...
12-06
2008-
6 Albury 14.3 25.0 0.0 NaN NaN W 50.0 SW ...
12-07
2008-
7 Albury 7.7 26.7 0.0 NaN NaN W 35.0 SSE ...
12-08
2008-
8 Albury 9.7 31.9 0.0 NaN NaN NNW 80.0 SE ...
12-09
2008-
9 Albury 13.1 30.1 1.4 NaN NaN W 28.0 S ...
12-10
10 rows × 24 columns
Out[4]: Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ...
2017-
145455 Uluru 2.8 23.4 0.0 NaN NaN E 31.0 SE ...
06-21
2017-
145456 Uluru 3.6 25.3 0.0 NaN NaN NNW 22.0 SE ...
06-22
2017-
145457 Uluru 5.4 26.9 0.0 NaN NaN N 37.0 SE ...
06-23
2017-
145458 Uluru 7.8 27.0 0.0 NaN NaN SE 28.0 SSE ...
06-24
2017-
145459 Uluru 14.9 NaN 0.0 NaN NaN NaN NaN ESE ...
06-25
5 rows × 24 columns
data preprocessing
In [5]: data.info() #dispaly information about dataet
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 24 columns):
# Column Non-Null Count Dtype
In [8]: print(data.describe(include=[object]))
RainTomorrow
count 142193
unique 2
top No
freq 110316
In [11]: data.isnull().sum()
Out[11]: Date 0
Location 0
MinTemp 1485
MaxTemp 1261
Rainfall 3261
Evaporation 62790
Sunshine 69835
WindGustDir 10326
WindGustSpeed 10263
WindDir9am 10566
WindDir3pm 4228
WindSpeed9am 1767
WindSpeed3pm 3062
Humidity9am 2654
Humidity3pm 4507
Pressure9am 15065
Pressure3pm 15028
Cloud9am 55888
Cloud3pm 59358
Temp9am 1767
Temp3pm 3609
RainToday 3261
RISK_MM 3267
RainTomorrow 3267
dtype: int64
In [12]: data.shape
In [15]: data['RainTomorrow'].value_counts()
110316
Out[15]: No
Yes 31877
Name: RainTomorrow, dtype: int64
In [16]: plt.figure(figsize=(17,15))
corre = sns.heatmap(data.corr(), square=True, annot=True)
corre.set_xticklabels(corre.get_xticklabels(), rotation=90)
plt.show()
Handling Missing values in Categorical Features:
In [17]: categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']
data[categorical_features].isnull().sum()
0
Out[17]: Date
Location 0
WindGustDir 10326
WindDir9am 10566
WindDir3pm 4228
RainToday 3261
RainTomorrow 3267
dtype: int64
In [18]: # Imputing the missing values in categorical features using the most frequent value which is mode:
categorical_features_with_null = [feature for feature in categorical_features if data[feature].isnull().sum
for each_feature in categorical_features_with_null:
mode_val = data[each_feature].mode()[0]
data[each_feature].fillna(mode_val,inplace=True)
In [20]: data[numerical_features].isnull().sum().plot.bar()
Now, numerical features are free from outliers. Let’s Impute missing values in
numerical features using mean
In [22]: numerical_features_with_null = [feature for feature in numerical_features if data[feature].isnull().sum()]
for feature in numerical_features_with_null:
mean_value = data[feature].mean()
data[feature].fillna(mean_value,inplace=True)
In [23]: data.isnull().sum()
Out[23]: Date 0
Location 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm
0
RainToday 0
RISK_MM 0
RainTomorrow
0
dtype: int64
Encoding
In [24]: def encode_data(feature_name):
'''
This function takes feature name as a parameter and returns mapping dictionary to replace(or map) categ
'''
mapping_dict = {}
unique_values = list(data[feature_name].unique())
mapping_dict[unique_values[idx]] = idx
return mapping_dict
data['WindGustDir'].replace(encode_data('WindGustDir'),inplace = True)
data['WindDir9am'].replace(encode_data('WindDir9am'),inplace = True)
data['WindDir3pm'].replace(encode_data('WindDir3pm'),inplace = True)
In [26]: encode_data('RainToday')
data['RainTomorrow']
data['WindGustDir']
data['WindDir9am']
data['WindDir3pm']
data['Location']
0
Out[26]: 0
1 0
2 0
3 0
4 0
..
145455 48
145456 48
145457 48
145458 48
145459 48
Name: Location, Length: 145460, dtype: int64
In [27]: data['RainToday'].replace({'No':0, 'Yes': 1}, inplace = True)
data['WindGustDir'].replace(encode_data('WindGustDir'),inplace = True)
data['WindDir9am'].replace(encode_data('WindDir9am'),inplace = True)
data['WindDir3pm'].replace(encode_data('WindDir3pm'),inplace = True)
In [ ]: import pandas as pd
In [ ]: data = pd.read_csv('/content/xAPI-Edu-Data.csv')
data.head()
Out[ ]: gender NationalITy PlaceofBirth StageID GradeID SectionID Topic Semester Relation raisedhands VisITedResources
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
# Column Non-Null Count Dtype
0 15 16 2
1 20 20 3
2 10 7 0
3 30 25 5
4 40 50 12
475 5 4 5
476 50 77 14
477 55 74 25
478 30 17 14
479 35 14 23
In [ ]: sns.boxplot(data['raisedhands'])
Out[ ]: <AxesSubplot:xlabel='raisedhands'>
sns.boxplot(data['VisITedResources'])
<AxesSubplot:xlabel='VisITedResources'>
sns.boxplot(data['AnnouncementsView'])
Out[ ]: <AxesSubplot:xlabel='AnnouncementsView'>
Data Description:
Link of the dataset:
https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
About dataset:
The growth of supermarkets in most populated cities are increasing and market competitions are
also high. The dataset is one of the historical sales of supermarket company which has recorded in 3
different branches for 3 months data.
Attribute information:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [6]: # Read the datasets
sales = pd.read_csv('supermarket_sales.csv')
In [7]: sales.head()
Out[7]:
Invoice Customer Product Unit
Branch City Gender Quantity Tax 5% Total Dat
ID type line price
750-
Health and
0 67- A Yangon Member Female 74.69 7 26.1415 548.9715 1/5/201
beauty
8428
226-
Electronic
1 31- C Naypyitaw Normal Female 15.28 5 3.8200 80.2200 3/8/201
accessories
3081
631-
Home and
2 41- A Yangon Normal Male 46.33 7 16.2155 340.5255 3/3/201
lifestyle
3108
123-
Health and
3 19- A Yangon Member Male 58.22 8 23.2880 489.0480 1/27/201
beauty
1176
373-
Sports and
4 73- A Yangon Normal Male 86.31 7 30.2085 634.3785 2/8/201
travel
7910
In [8]: sales.tail()
Out[8]:
Invoice Customer Product Unit
Branch City Gender Quantity Tax 5% Total
ID type line price
233-
Health and
995 67- C Naypyitaw Normal Male 40.35 1 2.0175 42.3675 1/29/
beauty
5758
303-
Home and
996 96- B Mandalay Normal Female 97.38 10 48.6900 1022.4900 3/2/
lifestyle
2227
727-
Food and
997 02- A Yangon Member Male 31.84 1 1.5920 33.4320 2/9/
beverages
1313
347-
Home and
998 56- A Yangon Normal Male 65.82 1 3.2910 69.1110 2/22/
lifestyle
2442
849-
Fashion
999 09- A Yangon Member Female 88.34 7 30.9190 649.2990 2/18/
accessories
3807
In [9]: sales.shape
(1000, 17)
Out[9]:
sales.isnull().sum()
Invoice ID 0
Out[10]:
Branch 0
City 0
Customer type 0
Gender 0
Product line 0
Unit price 0
Quantity 0
Tax 5% 0
Total 0
Date 0
Time 0
Payment 0
cogs 0
gross margin percentage 0
gross income 0
Rating 0
dtype: int64
sales.count()
Invoice ID 1000
Out[11]:
Branch 1000
City 1000
Customer type 1000
Gender 1000
Product line 1000
Unit price 1000
Quantity 1000
Tax 5% 1000
Total 1000
Date 1000
Time 1000
Payment 1000
cogs 1000
gross margin percentage 1000
gross income 1000
Rating 1000
dtype: int64
sales.sum()
Invoice ID 750-67-8428226-31-3081631-41-3108123-19-
Out[12]:
Branch 117637...
City ACAAACACABBBAAABAAABCBBAAABABABBBACCAACBBCBCCB..
Customer type .
Gender YangonNaypyitawYangonYangonYangonNaypyitawYang..
Product line .
Unit price MemberNormalNormalMemberNormalNormalMemberNorm..
Quantity .
Tax 5% FemaleFemaleMaleMaleMaleMaleFemaleFemaleFemale..
Total .
Date Health and beautyElectronic accessoriesHome an...
Time 55672.13
Payment 5510
cogs 15379.369
gross margin percentage 322966.749
gross income 1/5/20193/8/20193/3/20191/27/20192/8/20193/25/...
Rating 13:0810:2913:2320:3310:3718:3014:3611:3817:151...
dtype: object EwalletCashCredit cardEwalletEwalletEwalletEwa...
307587.38
In [13]: sales['Total'].sum() 4761.904762
15379.369
322966.749 6972.7
Out[13]:
In [14]: sales[['Total','gross income']].sum()
In [15]: # What is the average satisfaction level and what is the spread of it -- using mean and
st
sales.mean()
sales['Rating'].mean()
6.972700000000003
In [16]: sales.std()
sales['Rating'].std()
1.718580294379123
In [21]: # Minimum and Maximum Quantity sold -- using min() and max()
In [22]: sales['Quantity'].max()
10
Out[22]:
In [23]: # Daily increasing sales(how the sales increasing in daily bases) -- using
cumsum()
In [24]: sales['Total'].head()
0 548.9715
Out[24]:
1 80.2200
2 340.5255
3 489.0480
4 634.3785
Name: Total, dtype: float64
In [25]: # Median
median =
sales['Quantity'].median() median
5.0
Out[25]:
In [27]: # Var
var = sales['Quantity'].var()
var
8.546446446446451
Out[27]:
In [29]: # group by
groupby_sum =
sales.groupby(['City']).sum() groupby_sum
Out[29]: Unit gross margin gross
Quantity Tax 5% Total cogs Rating
price percentage income
City
In [30]: groupby_count =
sales.groupby(['City']).count() groupby_count
Out[30]:
Invoice Customer Product Unit Tax
Branch Gender Quantity Total Date Time Paymen
ID type line price 5%
City
Mandalay 332 332 332 332 332 332 332 332 332 332 332 33
Naypyitaw 328 328 328 328 328 328 328 328 328 328 328 32
Yangon 340 340 340 340 340 340 340 340 340 340 340 34
In [31]: groupby_sum1 =
sales.groupby(['Payment']).sum() groupby_sum1
Payment
In [32]: groupby_count1 =
sales.groupby(['Payment']).count() groupby_count1
Out[32]:
Invoice Customer Product Unit Tax
Branch City Gender Quantity Total Date Time cog
ID type line price 5%
Payment
Cash 344 344 344 344 344 344 344 344 344 344 344 344 34
Credit
311 311 311 311 311 311 311 311 311 311 311 311 31
card
Ewallet 345 345 345 345 345 345 345 345 345 345 345 345 34
In [35]: # describe
In [36]: sales.describe(include='object')
Out[36]: Customer
Invoice ID Branch City Gender Product line Date Time Payment
type
count 1000 1000 1000 1000 1000 1000 1000 1000 1000
750-67- Fashion
top A Yangon Member Female 2/7/2019 19:48 Ewallet
8428 accessories
In [37]: sales.describe(include='all')
count 1000 1000 1000 1000 1000 1000 1000.000000 1000.000000 1000.000000
1
unique 1000 3 3 2 2 6
NaN NaN NaN
750-
Fashion
top 67- A Yangon Member Female NaN NaN NaN
accessories
8428
mean NaN NaN NaN NaN NaN NaN 55.672130 5.510000 15.379369
std NaN NaN NaN NaN NaN NaN 26.494628 2.923431 11.708825
min NaN NaN NaN NaN NaN NaN 10.080000 1.000000 0.508500
25% NaN NaN NaN NaN NaN NaN 32.875000 3.000000 5.924875
50% NaN NaN NaN NaN NaN NaN 55.230000 5.000000 12.088000
75% NaN NaN NaN NaN NaN NaN 77.935000 8.000000 22.445250
max NaN NaN NaN NaN NaN NaN 99.960000 10.000000 49.650000 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [40]: iris.head()
Out[40]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
In [41]: iris.tail()
(150, 6)
Out[42]:
In [44]: iris.isnull().sum()
Id 0
Out[44]:
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [45]: iris.describe()
In [46]: # How many data points for each class are present or flowers for each species are
present?
iris['Species'].value_counts()
Iris-setosa 50
Out[46]:
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64
iris.plot(kind='scatter', x='SepalLengthCm',
y='SepalWidthCm') plt.show()
In [48]: iris.mean()
Id 75.500000
Out[48]:
SepalLengthCm 5.843333
SepalWidthCm 3.054000
PetalLengthCm 3.758667
PetalWidthCm 1.198667
dtype: float64
iris.Species.mode()
0 Iris-setosa
1 Iris-versicolor
2 Iris-virginica
Name: Species, dtype: object
In [52]: setosa_data
38
39 39
40 4.4
5.1 3.0
3.4 1.3
1.5 0.2
0.2 Iris-setosa
Iris-setosa
In [55]: versicolor_stats = {
'percentile': np.percentile(versicolor_data['SepalLengthCm'],
50), 'mean': np.mean(versicolor_data['SepalLengthCm']),
'std_dev': np.std(versicolor_data['SepalLengthCm'])
}
The objective is to predict the value of prices of the house using the given features.
Problem Statement:
The dataset used in this project comes from the kaggle websites.
This data was collected in 1978 and each of the 506 entries represents aggreagate information about 14 features of
homes located in Boston.
Attribute information:
Linear Regression
Linear Regression is one of the most fundamental and widely known Machine Learning Algorithm.
A Linear Regression model predicts the dependent variable using a regression line based on the independent variables.
The equation of the Linear Regression is:
Y = a*X + C + e
Where, C is the intercept,
m is the slope of the line
e is the error term
The equation above is used to predict the value of the targer variable based on the given predictor variable(s).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]: df = pd.read_csv('HousingData.csv')
In [3]: df.head()
Out[3]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 NaN 36.2
In [4]: df.tail()
Out[4]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 NaN 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 NaN 2.5050 1 273 21.0 396.90 7.88 11.9
In [5]: df.describe()
count 486.000000 486.000000 486.000000 486.000000 506.000000 506.000000 486.000000 506.000000 506.000000 506.000000 506.
mean 3.611874 11.211934 11.083992 0.069959 0.554695 6.284634 68.518519 3.795043 9.549407 408.237154 18.
std 8.720192 23.388876 6.835896 0.255340 0.115878 0.702617 27.999513 2.105710 8.707259 168.537116 2.
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.
25% 0.081900 0.000000 5.190000 0.000000 0.449000 5.885500 45.175000 2.100175 4.000000 279.000000 17.
50% 0.253715 0.000000 9.690000 0.000000 0.538000 6.208500 76.800000 3.207450 5.000000 330.000000 19.
75% 3.560263 12.500000 18.100000 0.000000 0.624000 6.623500 93.975000 5.188425 24.000000 666.000000 20.
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.
In [6]: df.shape
In [7]: df.dtypes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
In [9]: df.isna().sum()
20
Out[9]: CRIM
ZN 20
INDUS 20
CHAS 20
NOX 0
RM 0
AGE 20
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 20
MEDV 0
dtype: int64
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
The mean() method is used to calculate the mean of each numeric column in the dataset.
The fillna() method is then used to replace all missing values in the dataframe with the mean values for their respective
columns.
The isnull() method is used to check if there are any remaining missing values in the dataframe.
x = df.drop(target_feature, axis=1)
y = df[target_feature]
In [14]: x.head()
Out[14]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.980000
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.140000
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.030000
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.940000
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 12.715432
In [15]: y.head()
24.0
Out[15]: 0
1 21.6
2 34.7
3 33.4
4 36.2
Name: MEDV, dtype: float64
Use model_selection.train_test_split from sklearn to split the data into training and
testing sets. Set test_size=0.2 and random_state=0
In [16]: from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
Out[18]: LinearRegression()
In [22]: predictions
Create a scatterplot of the real test values versus the predicted values
In [23]: plt.scatter(y_test, predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Out[25]: 57.03
it means , it looks like our model r2 score is less on the test data
In [26]: from sklearn import metrics
print('Mean Absolute Error on test data of Linear Regression: ',metrics.mean_absolute_error(y_test, predictions
print('Mean Squared Error on test data of Linear Regression: ',metrics.mean_squared_error(y_test, predictions))
print('Root Mean Squared Error on test data of Linear Regression: ',np.sqrt(metrics.mean_squared_error(y_test,
In [28]: df.head(15)
Out[28]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.000000 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.980000 24.0
1 0.02731 0.0 7.07 0.000000 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.140000 21.6
2 0.02729 0.0 7.07 0.000000 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.030000 34.7
3 0.03237 0.0 2.18 0.000000 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.940000 33.4
4 0.06905 0.0 2.18 0.000000 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 12.715432 36.2
5 0.02985 0.0 2.18 0.000000 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.210000 28.7
6 0.08829 12.5 7.87 0.069959 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.430000 22.9
7 0.14455 12.5 7.87 0.000000 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.150000 27.1
8 0.21124 12.5 7.87 0.000000 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.930000 16.5
9 0.17004 12.5 7.87 0.069959 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.100000 18.9
10 0.22489 12.5 7.87 0.000000 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.450000 15.0
11 0.11747 12.5 7.87 0.000000 0.524 6.009 82.9 6.2267 5 311 15.2 396.90 13.270000 18.9
12 0.09378 12.5 7.87 0.000000 0.524 5.889 39.0 5.4509 5 311 15.2 390.50 15.710000 21.7
13 0.62976 0.0 8.14 0.000000 0.538 5.949 61.8 4.7075 4 307 21.0 396.90 8.260000 20.4
14 0.63796 0.0 8.14 0.069959 0.538 6.096 84.5 4.4619 4 307 21.0 380.02 10.260000 18.2
In [29]: regression.predict([[0.62976,0.0,8.14,0.0,0.538,5.949,61.8,4.7075,4,307,21.0,396.60,8.26]])
Out[29]: array([19.58009845])
In [30]: regression.intercept_
Out[30]: 35.0401660294875
In [31]: regression.coef_
let's plot a bar chart of above coefficients using matplotlb plotting library
ax.bar(lr_coefficient["columns"],
lr_coefficient['Coefficient Estimate'])
ax.spines['bottom'].set_position('zero')
plt.style.use('ggplot')
plt.grid()
plt.show()
color = ['tab:gray', 'tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple', 'tab:brown', 'tab:pink', '
ax.bar(lr_coefficient["columns"],
lr_coefficient['Coefficient Estimate'],color = color)
ax.spines['bottom'].set_position('zero')
plt.style.use('ggplot')
plt.show()
Data Analytics II
1. Implement logistic regression using Python to perform classification on Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP,FP,Tn,FN,Accuracy,Error rate,Precision,Recall on the given dataset.
Our dataset contains some information about all of our users in the social network, including their User ID, Gender, Age, and
Estimated Salary. The last column of the dataset is a vector of booleans describing whether or not each individual ended up
clicking on the advertisement (0 = False, 1 = True).
In [89]: data
In [90]: data.head(5)
In [91]: data.tail()
Out[92]: (400, 5)
In [93]: data.columns
In [94]: data.describe()
In [95]: data.isnull().sum()
Out[95]: User ID 0
Gender 0
Age 0
EstimatedSalary 0
Purchased 0
dtype: int64
In [96]: # requier 2 columns age and salary from given data frame
data.iloc[:,2:4]
0 19 19000
1 35 20000
2 26 43000
3 27 57000
4 19 76000
395 46 41000
396 51 23000
397 50 20000
398 36 33000
399 49 36000
In [102… sns.pairplot(data,hue='Purchased',palette='bwr')
In [104… X = data[['Age','EstimatedSalary']]
y = data['Purchased']
In [106… X_train.shape
Out[106]: (300, 2)
In [107… X_test.shape
Out[107]: (100, 2)
In [108… y_train.shape
Out[108]: (300,)
In [109… y_test.shape
Out[109]: (100,)
Out[111]: ▾ LogisticRegression
LogisticRegression()
In [114… print(classification_report(y_test,predictions))
cm = confusion_matrix(y_test, predictions)
print(cm)
[[68 0]
[32 0]]
This Confusion Matrix tells us that there were 68 correct predictions and 32 incorrect ones, meaning the model overall
accomplished an 68% accuracy rating.
Data Analytics III
Que: Implement simple naive bayes classification algorithm using python on iris.csv dataset and compare confusion matrix to
find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset
Probability: Probability is a number that reflects the chance. Conditional Probability: Conditional probability is the
probability of an event happening, given that another event has already happened. For example, the probability of it raining
tomorrow given that it is cloudy today. Think of it as "if-then" probability. If a certain condition is met (it is cloudy), then the
likelihood of another event occurring (it raining) can change. It allows us to see how one event influences the probability of
another event. Naive Bayes Classifier: Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems. Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine learning models that can make quick predictions. It is a
probabilistic classifier, which means it predicts on the basis of the probability of an object. The Naïve Bayes algorithm is
comprised of two words Naïve and Bayes, Which can be described as: Naïve: It is called Naïve because it assumes that the
occurrence of a certain feature is independent of the occurrence of other features. Bayes: It is called Bayes because it
depends on the principle of Bayes' Theorem. Bayes' Theorem: Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.
In [2]: df = pd.read_csv('iris.csv')
In [3]: df
In [4]: df.shape
Out[4]: (150, 6)
In [5]: df.describe()
In [6]: df.isnull()
In [7]: df.isnull().sum()
Out[7]: Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [8]: x = df.drop(["Species"],axis=1)
y = df["Species"]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(120, 5)
(120,)
(30, 5)
(30,)
Train Naive Bayes Classfier Model
In [11]: from sklearn.naive_bayes import MultinomialNB
Out[12]: MultinomialNB()
Out[13]: 0.8333333333333334
In [15]: y_pred
In [16]: y_test
Confusion Matrix
In [17]: import sklearn.metrics
lbs = ['Iris-versicolor','Iris-setosa','Iris-virginica']
print(sklearn.metrics.confusion_matrix(y_test, y_pred, labels = lbs))
[[10 0 3]
[ 1 10 0]
[ 1 0 5]]
In [18]: from sklearn.metrics import classification_report
accuracy 0.83 30
macro avg 0.82 0.84 0.82 30
weighted avg 0.85 0.83 0.84 30
Out[21]: GaussianNB()
Out[22]: 0.8333333333333334
In [24]: y_pred
In [25]: y_test
#import nltk
#nltk.download()
Tokenization of words:
We use the method word_tokenize() to split a sentence into words.
word tokenization becomes a crucial part of the text (string) to numeric data conversion
Out[2]: True
Word Tokenizer
In [3]: import nltk
from nltk.tokenize import word_tokenize
text = "Welcome to the Python Programming at Indeed Insprining Infotech"
print(word_tokenize(text))
Sentence Tokenizer
In [4]: from nltk.tokenize import sent_tokenize
text = "Hello Everyone. Welcome to the Python Programming"
print(sent_tokenize(text))
Stemming
When we have many variations of the same word for example...the word is dance and the variations are "dancing",
"dances","danced".
Stemming algorithm works by cutting the suffix from the word.
clean
clean
clean
clean
lemmatization
Why is Lemmatization better than Stemming?
Stemming algorithm woks by cutting the suffix from the word and Lemmatization is a more powerful operation
because it perform morphological analysis of the words.
Stemming Code:
In [6]: import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying floors cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print('Stemming for ', w,'is',porter_stemmer.stem(w))
Stemming for studies is studi
Stemming for studying is studi
Stemming for floors is floor
Stemming for cry is cri
lemmatization Code:
In [7]: import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
Out[7]: True
In [9]: nltk.download('stopwords')
Out[9]: True
data = 'AI was introduced in the year 1956 but it gained popularity recently.'
stopwords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopwords:
wordsFiltered.append(w)
print(wordsFiltered)
179
{'but', 'only', 'couldn', "don't", 'who', 'each', 'yours', 'with', 'had', 'they', 'don', 'does', 'herself', 'i
f', 'doesn', 'my', 'are', 'an', 'shouldn', 'she', 'doing', 'hasn', 'ourselves', 's', 'i', 'itself', "won't",
'o', "couldn't", 'shan', 'no', 'just', 'more', 'above', 'ain', 'ma', 'through', 'him', 'once', 'didn', 'wasn',
'few', "you're", "wasn't", 'both', 'not', 'having', 'over', 'to', 'hers', 'or', 'on', 'because', 'off', "is
n't", "shan't", 'is', 'below', 'same', "you'd", 'you', 'the', 'y', 'me', "haven't", 'those', 'that', 'wouldn',
'of', 'own', 'we', 'then', 'will', 'them', 'too', 'mustn', 'until', 'some', 'nor', "you've", "doesn't", "it's",
't', "mustn't", 'aren', 've', 'did', 'a', "should've", 'ours', 'd', 'himself', 'all', 'before', 'haven', 'out',
'between', 'now', 'been', 'into', 'needn', 'what', 'was', 'be', 'am', 'as', "aren't", 'down', 'after', 'their',
'during', 'so', 'it', 'which', 'here', 'other', 'than', 'm', "mightn't", 'won', 'for', "that'll", 'from', 'bein
g', 'against', "you'll", 'themselves', "needn't", 'its', 'very', 'he', 'in', 'her', 'most', 'can', "weren't",
"shouldn't", 'again', 'while', "she's", 'and', "didn't", 'further', 'such', "hasn't", 'at', 'these', 'where',
'weren', 'up', 'do', 'll', 'isn', 'should', "wouldn't", 'myself', 'under', 'any', 'how', 'your', 'mightn', 'hav
e', 'theirs', 'whom', 'when', 'yourselves', 'this', 'has', "hadn't", 'were', 'there', 'about', 'why', 'his', 'o
ur', 're', 'yourself', 'by', 'hadn'}
Text Analytics
1. Extract Sample document and apply following document preprocessing methods: Tokenization, POS Tagging, stop words
removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document Frequency.
In [14]: # Tokenization
tokens = word_tokenize(document)
In [15]: tokens
Out[15]: ['This',
'is',
'an',
'example',
'document',
'that',
'we',
'will',
'use',
'to',
'demonstrate',
'document',
'preprocessing',
'.']
Out[16]: True
In [20]: # Stemming
ps = PorterStemmer()
stemmed_tokens = [ps.stem(word) for word in filtered_tokens]
In [21]: # Lemmatization
wnl = WordNetLemmatizer()
lemmatized_tokens = [wnl.lemmatize(word) for word in filtered_tokens]
Text Analytics
Create representation of document by calculating Term Frequency and Inverse Document Frequency.
This code uses the Counter class from the collections module to count the term frequency for each document. It then
calculates the inverse document frequency by iterating over the tokenized documents and keeping track of the number of
documents that each term appears in. Finally, it multiplies the term frequency of each term in each document by its
corresponding inverse document frequency to get the TF-IDF weight for each term in each document. The resulting TF-IDF
representation for each document is printed to the console.
Data Visualization I
i)Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information about the passengers who boarded
the unfortunate Titanic ship. Use the Seaborn library to see if we can find any patterns in the data.
ii)Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by plotting a
histogram.
iii)Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with respect to each
gender along with the information about whether they survived or not. (column names: 'sex' and 'age')
Alternatively, if you are using the Anaconda distribution of Python, you can use execute the following command to
download the seaborn library:
conda install seaborn
dataset.head()
Out[2]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
The dataset contains 891 rows and 15 columns and contains information about the passengers who boarded the unfortunate
Titanic ship. The original task is to predict whether or not the passenger survived depending upon different features such as
their age, ticket, cabin they boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
In [3]: dataset.shape
In [5]: dataset.isnull()
Out[5]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False
6 False False False False False False False False False False False False False False False
10 False False False False False False False False False False False False False False False
11 False False False False False False False False False False False False False False False
Out[5]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
871 False False False False False False False False False False False False False False False
872 False False False False False False False False False False False False False False False
879 False False False False False False False False False False False False False False False
887 False False False False False False False False False False False False False False False
In [6]: dataset.isnull().sum()
Out[6]: survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 0
class 0
who 0
adult_male 0
deck 0
embark_town 0
alive 0
alone 0
dtype: int64
Distributional Plots
Distributional plots, as the name suggests are type of plots that show the statistical distribution of data.
In [8]: # Let's see how the price of the ticket for each passenger is distributed.
sns.distplot(dataset['fare'])
You can see that most of the tickets have been solved between 0-50 dollars. The line that you see represents the kernel
density estimation. You can remove this line by passing False as the parameter for the kde attribute as shown below:
In [9]: sns.distplot(dataset['fare'], kde=False)
Now you can see there is no line for the kernel density estimation on the plot.
You can also pass the value for the bins parameter in order to see more or less details in the graph
<Axes: xlabel='fare'>
In the output, you will see data distributed in 10 bins You can clearly see that for more than 700 passengers, the ticket price is
between 0 and 50.
Data Visualization III
Download the Iris flower dataset or any other dataset into a DataFrame. Scan the dataset and give the inference as:
1. List down the features and theri types (ex, numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a boxplot for each feature in the dataset.
4. Compare distribution and identity outliers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]: Iris.shape
Out[3]: (150, 6)
In [4]: Iris.describe()
In [5]: Iris.dtypes
Out[5]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
In [ ]: Iris.isnull().sum()
Out[ ]: Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
In [ ]: print(Iris.groupby('Species').size())
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
Data Visualization
In [ ]: # Box and whisker plots
Iris.plot(kind='box', subplots=True, layout=(3,2), figsize=(8,12));
In [ ]: Iris.hist(figsize=(12,12))
plt.show()
In [ ]: Iris.corr()
Out[ ]: <AxesSubplot:>
In [ ]: sns.pairplot(Iris)
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
Step 2: List down the features and their types
The Iris flower dataset contains the following features:
iris_df.hist()
plt.show()
In [ ]: iris_df.boxplot()
plt.show()
Step 5: Compare distribution and identify outliers
By looking at the histograms and boxplots, we can see the distribution of each feature and identify any outliers.
For example, we can see that the petal length and petal width have a bimodal distribution. The sepal length and sepal width
have a more normal distribution. We can also see that there are some outliers in the sepal width and petal length features.
We can further investigate the outliers using the describe method of the DataFrame as follows:
In [ ]: iris_df.describe()
Out[ ]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0.5
In [ ]: data = Iris[Iris.SepalWidthCm < (Q1 - 1.5 * IQR) / (Iris.SepalWidthCm > (Q3 + 1.5 * IQR))]
In [ ]: data
ii)Write a code to check how the price of the ticket (column name: 'fare') for each passenger is distributed by plotting a
histogram.
iii)Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution of age with respect to each
gender along with the information about whether they survived or not. (column names: 'sex' and 'age')
Alternatively, if you are using the Anaconda distribution of Python, you can use execute the following command to
download the seaborn library:
conda install seaborn
In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]: dataset = sns.load_dataset('titanic')
dataset.head()
Out[ ]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
The dataset contains 891 rows and 15 columns and contains information about the passengers who boarded the unfortunate
Titanic ship. The original task is to predict whether or not the passenger survived depending upon different features such as
their age, ticket, cabin they boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.
In [ ]: dataset.shape
Out[ ]: survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False
6 False False False False False False False False False False False False False False False
10 False False False False False False False False False False False False False False False
11 False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
871 False False False False False False False False False False False False False False False
872 False False False False False False False False False False False False False False False
879 False False False False False False False False False False False False False False False
887 False False False False False False False False False False False False False False False
889 False False False False False False False False False False False False False False False
In [ ]: dataset.isnull().sum()
Out[ ]: survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 0
class 0
who 0
adult_male 0
deck 0
embark_town 0
alive 0
alone 0
dtype: int64
In [ ]: # Let's plot a joint plot of age and fare columns to see if we can find any relationship between the two.
sns.jointplot(x='age', y='fare', data=dataset)
From the output, you can see that a joint plot has three parts. A distribution plot at the top for the column on the x-axis, a
distribution plot on the right for the column on the y-axis and a scatter plot in between that shows the mutual distribution of
data for both the columns. You can see that there is no correlation observed between prices and the fares.
You can change the type of the joint plot by passing a value for the kind parameter. For instance, if instead of scatter plot, you
want to display the distribution of data in the form of a hexagonal plot, you can pass the value hex for the kind parameter.
In the hexagonal plot, the hexagon with most number of points gets darker color. So if you look at the above plot, you can
see that most of the passengers are between age 20 and 30 and most of them paid between 10-50 for the tickets.
The Rug Plot
The rugplot() is used to draw small bars along x-axis for each point in the dataset.
In [ ]: # To plot a rug plot, you need to pass the name of the column. Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])
Out[ ]: <AxesSubplot:xlabel='fare'>
you can see that as was the case with the distplot(), most of the instances for the fares have values between 0 and 100.
These are some of the most commonly used distribution plots offered by the Python's Seaborn Library.
Categorical Plots
Categorical plots, as the name suggests are normally used to plot categorical data. The categorical plots plot the values in the
categorical column against another categorical column or a numeric column. Let's see some of the most commonly used
categorical data.
In [ ]: # to know the mean value of the age of the male and female passengers, you can use the bar plot
you can clearly see that the average age of male passengers is just less than 40 while the average age of female passengers is
around 33.
to finding the average, the bar plot can also be used to calculate other aggregate values for each category. To do so, you
need to pass the aggregate function to the estimator.
In [ ]: # # To calculate the standard deviation for the age of each gender
import numpy as np
# we use the std aggregate function from the numpy library to calculate the standard deviation for the ages of
In [ ]: # to count the number of males and women passenger we can do so using count plot
sns.countplot(x='sex', data=dataset)
In [ ]: # let's plot a box plot that displays the distribution for the age with respect to each gender. You need to pa
Let's try to understand the box plot for female. The first quartile starts at around 5 and ends at 22 which means that 25% of
the passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at around 32 which means that
25% of the passengers are aged between 23 and 32. Similarly, the third quartile starts and ends between 34 and 42, hence
25% passengers are aged within this range and finally the fourth or last quartile starts at 43 and ends around 65.
If there are any outliers or the passengers that do not belong to any of the quartiles, they are called outliers and are
represented by dots on the box plot.
In [ ]: # if you want to see the box plots of forage of passengers of both genders, along with the information about w
In [ ]: # Let's plot a violin plot that displays the distribution for the age with respect to each gender.
violin plots provide much more information about the data as compared to the box plot. Instead of plotting the quartile, the
violin plot allows us to see all the components that actually correspond to the data. The area where the violin plot is thicker
has a higher number of instances for the age.
In [ ]: # add another categorical variable to the violin plot using the hue parameter
Now you can see a lot of information on the violin plot. For instance, if you look at the bottom of the violin plot for the males
who survived (left-orange), you can see that it is thicker than the bottom of the violin plot for the males who didn't survive
(left-blue). This means that the number of young male passengers who survived is greater than the number of young male
passengers who did not survive. The violin plots convey a lot of information, however, on the downside, it takes a bit of time
and effort to understand the violin plots.
The Strip Plot
The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter plots in the joint plot and
the pair plot sections where we had two numeric variables. The strip plot is different in a way that one of the variables is
categorical in this case, and for each category in the categorical variable, you will see scatter plot with respect to the numeric
column.
see the scattered plots of age for both males and females. The data points look like strips. It is difficult to comprehend the
distribution of data in this form.
In [ ]: # Like violin and box plots, you can add an additional categorical column to strip plot using hue parameter as
Seaborn is an advanced data visualization library built on top of Matplotlib library. In this above all code, we looked
at how we can draw distributional and categorical plots using Seaborn library.