Ai Tools and Applications-Lab
Ai Tools and Applications-Lab
import numpy as np
# Enter the value of n
n=int(input('Enter no. of values:'))
# Generates n random numbers from Normal Distribution
rand_num = np.random.normal(0,1,n)
print(n, " random numbers from a standard normal distribution:")
print(rand_num)
arr=np.array([rand_num])
# Displays the size of the Array
print(arr.shape)
output
Enter no. of values:10
10 random numbers from a standard normal distribution:
[-0.34953998 1.60514591 -0.60005696 0.26263808 0.87930153 0.9833943
7
0.40472381 -0.73362668 -0.20067116 -0.97191095]
(1, 10)
min_element_column = np.amin(arr, 0)
min_element_row = np.amin(arr, 1)
output
output
[ 0 1 2 3 4 5 6 7 8 9 10]
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
[1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
output
0 ZERO
1 ONE
2 TWO
3 THREE
4 FOUR
5 FIVE
6 SIX
7 SEVEN
8 EIGHT
9 NINE
10 TEN
dtype: object
f. Create pandas series with data and index and display the index values.
output
a 10
b 20
c 30
d 40
e 50
dtype: int64
import pandas as pd
import numpy as np
df = pd.DataFrame(exam_data , index=labels)
print("Dataset is as follows")
print(df)
print("Summary of the Dataset")
print(df.info())
print("Statistical values of numerical attributes")
print(df.describe())
meanvalue=df.score.mean()
stdvalue=df.score.std()
print('mean value of Score is',meanvalue)
print('Standard deviation of score is',stdvalue)
output
Dataset is as follows
name score attempts qualify
a Anastasia 12.5 1 yes
b Dima 9.0 3 no
c Katherine 16.5 2 yes
d James NaN 3 no
e Emily 9.0 2 no
Summary of the Dataset
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null object
1 score 4 non-null float64
2 attempts 5 non-null int64
3 qualify 5 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 200.0+ bytes
None
Statistical values of numerical attributes
score attempts
count 4.000000 5.00000
mean 11.750000 2.20000
std 3.570714 0.83666
min 9.000000 1.00000
25% 9.000000 2.00000
50% 10.750000 2.00000
75% 13.500000 3.00000
max 16.500000 3.00000
mean value of Score is 11.75
Standard deviation of score is 3.570714214271425
output
[5 rows x 12 columns]
output
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
b. Find the number of non-null values in each column.
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
c. Find out the unique values in each categorical column and frequency of each unique
value.
# categorical is nothing but the datatype which is other than numerical datatype (i.e int,float
etc).
# to get the all categorical columns, we can use Dataframe.select_dtypes and we have to
specify which
#datatype we required.
# In our case it would be "object" datatype
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns are : ",categorical_cols)
print("printing the results")
for i in categorical_cols:
print("========== Column '"+i+"' =============")
print(data[i].value_counts())
output
d. Find the number of rows where age is greater than the mean age of data.
output
output
#reading data
data = pd.read_csv("https://raw.githubusercontent.com/naveenjoshii/Intro-to-
MachineLearning/master/Titanic/titanic.csv")
print(data.head()
output
[5 rows x 12 columns]
output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
output
# plotting countplot for Each gender who has survived and not survived
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=data,palette='colorblind')
output
<matplotlib.axes._subplots.AxesSubplot at 0x7f621a047810>
d. Number of survivals in each passenger class
output
<matplotlib.axes._subplots.AxesSubplot at 0x7f621a034510>
output
<matplotlib.axes._subplots.AxesSubplot at 0x7f6219b36390>
4. Perform Data Analysis on the California House Price data to answer the following
output
[5 rows x 10 columns]
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
Output
28.639486434108527
c. Determines top 10 localities with the high difference between income and house value.
Also, top 10 localities that have the lowest difference
#calculating the difference btw House value and income and adding new column
'diff_income_and_house_value' with difference values
data['diff_income_and_house_value'] = data['median_house_value'] - data['median_income']
# sorting the whole dataframe by the difference value in descending order
data.sort_values(by='diff_income_and_house_value', ascending=False,inplace=True)
#printing the top 10 localities with highest difference
print("the top 10 localities with highest difference")
print(data['ocean_proximity'].head(10))
#printing the top 10 localities with lowest difference
print("the top 10 localities with lowest difference")
print(data['ocean_proximity'].tail(10))
Output
# total no of rooms
total_rooms = data['total_rooms'].sum()
# total number of bedrooms
total_bedrooms = data['total_bedrooms'].sum()
#printing the ratio of bedrooms to total rooms
print(total_rooms//total_bedrooms)
Output
4.0
Output
ocean_proximity
<1H OCEAN 214850.0
INLAND 108500.0
ISLAND 414700.0
NEAR BAY 233800.0
NEAR OCEAN 229450.0
Name: median_house_value, dtype: float64
a. Determine the outliers in each non-categorical column of Titanic Data and remove them.
[5 rows x 12 columns]
df = data.copy()
lb,ub = detect_outliers(data["Fare"],4)
# removing the rows which are greater than upperbound
df.drop(df[df.Fare > ub].index, inplace=True)
# removing the rows which are less than lowerbound
df.drop(df[df.Fare < lb ].index, inplace=True)
lb,ub = detect_outliers(data["Age"],5)
# removing the rows which are greater than upperbound
df.drop(df[df.Age > ub].index, inplace=True)
# removing the rows which are less than lowerbound
df.drop(df[df.Age < lb].index, inplace=True)
b. Determine missing values in each column of Titanic data. If missing values account for
30% of data, then remove the column.
Output
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 20.113636
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 77.954545
Embarked 0.227273
dtype: float64
Output
# As we can see cabin column has more than 30% of missing values, so we have to drop that
column
df.drop(['Cabin'],inplace=True,axis=1)
# after removing the column cabin, printing the columns again. If you observe there is no
Cabin in the output
df.columns
Output
c. If missing values are less than 30% of entire data then create a new data frame
i. Missing values in numeric columns are filled with the mean of the corresponding
column.
Output
20.113636363636363
Output
0.0
ii. Missing values in categorical columns are filled with the most frequently occurring
value.
Output
0.22727272727272727
Output
0.0
a. Determine the categorical columns in Titanic Dataset. Convert Columns with string data
type to numerical data using encoding techniques.
Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 880 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 880 non-null int64
1 Survived 880 non-null int64
2 Pclass 880 non-null int64
3 Name 880 non-null object
4 Sex 880 non-null object
5 Age 880 non-null float64
6 SibSp 880 non-null int64
7 Parch 880 non-null int64
8 Ticket 880 non-null object
9 Fare 880 non-null float64
10 Embarked 880 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 122.5+ KB
Output
male
01
10
20
30
41
Output
QS
001
100
201
301
401
old_data = df.copy()
# we need to drop the sex and embarked columns and replace them with the newly created
dummies data frames
# as Name and Tickt is not making any impact on the output label, we can drop them also
df.drop(['Sex','PassengerId','Embarked','Name','Ticket'],axis=1,inplace=True)
df.head()
Output
Survived Pclass Age SibSp Parch Fare
0 0 3 22.0 1 0 7.2500
1 1 1 38.0 1 0 71.2833
2 1 3 26.0 0 0 7.9250
3 1 1 35.0 1 0 53.1000
4 0 3 35.0 0 0 8.0500
# After droping the Sex and Embarked columns, we are replacing them with out new data
frames
data = pd.concat([df,sex_df,embark_df],axis=1)
b. Convert data in each numerical column so that it lies in the range [0,1]
Output
1 1 1 38.0 1 0 71.2833. 0. 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000. 0 0 1
4 0 3 35.0 0 0 8.0500. 1 0 1
# Scaling the data using minmax scaler so that values should be lies btw [0,1]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data[['Age','Pclass','Survived','SibSp','Parch','Fare','male','Q','S']] =
scaler.fit_transform(data[['Age','Pclass','Survived','SibSp','Parch','Fare','male','Q','S']])
7. Implement the following models on Titanic Dataset and determine the values of
accuracy, precision, recall, f1 score and confusion matrix for the test data.
data.info()
Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 880 entries, 0 to 890
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 880 non-null float64
1 Pclass 880 non-null float64
2 Age 880 non-null float64
3 SibSp 880 non-null float64
4 Parch 880 non-null float64
5 Fare 880 non-null float64
6 male 880 non-null float64
7 Q 880 non-null float64
8 S 880 non-null float64
dtypes: float64(9)
memory usage: 108.8 KB
a. Logistic Regression
Output
Output
Output
predicted result !
array([1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0.,
0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0.,
1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 1.,
0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0.,
0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0.,
0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.,
1., 0., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1.,
1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
1., 1., 0., 1., 0., 0., 1., 0., 0.])
#confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predicted))
Output
[[144 24]
[ 28 68]]
# Precision Score
from sklearn.metrics import precision_score
print("Precision Score",precision_score(y_test,predicted))
Output
# Recall Score
from sklearn.metrics import recall_score
print("recall score",recall_score(y_test,predicted))
Output
Output
f1 score 0.723404255319149
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,predicted))
Output
precision recall f1-score support
Output
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)
#confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
Output
[[140 28]
[ 20 76]]
# Precision Score
from sklearn.metrics import precision_score
print("Precision Score",precision_score(y_test,y_pred))
Output
# Recall Score
from sklearn.metrics import recall_score
print("recall score",recall_score(y_test,y_pred))
Output
# F1 Score
from sklearn.metrics import f1_score
print("f1 score",f1_score(y_test,y_pred))
Output
f1 score 0.76
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
Output
Output
8. Implement the following models on the California House Pricing Dataset and determine
the values of R2 score, the area under roc curve and root mean squared error for the test
set.
a. Linear Regression with Polynomial Features
b. Random Forest Regressor
Output
longitude 0.000000
latitude 0.000000
housing_median_age 0.000000
total_rooms 0.000000
total_bedrooms 1.002907
population 0.000000
households 0.000000
median_income 0.000000
median_house_value 0.000000
ocean_proximity 0.000000
diff_income_and_house_value 0.000000
dtype: float64
# handling null values in total_bedrooms with the most frequent value in respective column
data["total_bedrooms"].fillna(data['total_bedrooms'].mode()[0],inplace=True)
Output
0.0
data.info()
Output
data.info()
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20640 entries, 4861 to 9188
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20640 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
10 diff_income_and_house_value 20640 non-null float64
dtypes: float64(10), object(1)
memory usage: 1.9+ MB
data['ocean_proximity'].unique()
Output
array(['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'],
dtype=object)
Output
4861 0 0 0 0
6688 1 0 0 0
16642 0 0 0 1
15661 0 0 1 0
15652 0 0 1 0
old_data = data.copy()
data.drop(['ocean_proximity','longitude','latitude','diff_income_and_house_value'],axis=1,inpl
ace=True)
data.head()
Output
1664
19.0 1540.0 715.0 1799.0 635.0 0.7025 500001.0
2
1566
27.0 1728.0 884.0 1211.0 752.0 0.8543 500001.0
1
1565
52.0 3260.0 1535.0 3260.0 1457.0 0.9000 500001.0
2
data = pd.concat([data,ocean_prox_df],axis=1)
data.head()
Output
NE
NE
housing_ total_ popu hous median median_h INL ISL AR
total_be AR
median_ag room latio ehold _incom ouse_valu AN AN OC
drooms BA
e s n s e e D D EA
Y
N
48 2690.
29.0 515.0 229.0 217.0 0.4999 500001.0 0 0 0 0
61 0
66
28.0 238.0 58.0 142.0 31.0 0.4999 500001.0 1 0 0 0
88
16
1540. 1799.
64 19.0 715.0 635.0 0.7025 500001.0 0 0 0 1
0 0
2
15
1728. 1211.
66 27.0 884.0 752.0 0.8543 500001.0 0 0 1 0
0 0
1
15
3260. 3260. 1457.
65 52.0 1535.0 0.9000 500001.0 0 0 1 0
0 0 0
2
#model initialization
model = LinearRegression()
Output
Output
r2 score is 0.590661764648472
Output
Output
expected = y_test
Output
r2 score is 0.7091234171276952
Output
10. Implement a single neural network and test for different logic gates.
#0r gate
import numpy as np
def unitStep(v):
if v >= 0:
return 1
else:
return 0
def perceptronModel(x, w, b):
v = np.dot(w, x) + b
y = unitStep(v)
return y
# OR Logic Function
# w1 = 1, w2 = 1, b = -0.5
def OR_logicFunction(x):
w = np.array([1, 1])
b = -0.5
return perceptronModel(x, w, b)
# testing the Perceptron Model
Output
OR(0, 1) = 1
OR(1, 1) = 1
OR(0, 0) = 0
OR(1, 0) = 1
# And gate
import numpy as np
# define Unit Step Function
def unitStep(v):
if v >= 0:
return 1
else:
return 0
Output
AND(0, 1) = 0
AND(1, 1) = 1
AND(0, 0) = 0
AND(1, 0) = 0
11. Write a program to train and test a Convolutional Neural Network to determine
the number, given an image of a handwritten digit. Determine the training and
validation accuracies of your model. (Train your model for 5 epochs).
from keras.datasets import mnist
# loading the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# let's print the shape of the dataset
Output
Output
# keras imports for the dataset and building our neural network
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D
from keras.utils import np_utils
=================================================================
Total params: 79,510
Trainable params: 79,510
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
469/469 [==============================] - 3s 5ms/step - loss: 0.3805 -
accuracy: 0.8950 - val_loss: 0.2060 - val_accuracy: 0.9409
Epoch 2/10
469/469 [==============================] - 2s 5ms/step - loss: 0.1812 -
accuracy: 0.9477 - val_loss: 0.1493 - val_accuracy: 0.9566
Epoch 3/10
469/469 [==============================] - 2s 5ms/step - loss: 0.1334 -
accuracy: 0.9613 - val_loss: 0.1223 - val_accuracy: 0.9644
Epoch 4/10
469/469 [==============================] - 2s 5ms/step - loss: 0.1055 -
accuracy: 0.9699 - val_loss: 0.1059 - val_accuracy: 0.9693
Epoch 5/10
469/469 [==============================] - 2s 5ms/step - loss: 0.0863 -
accuracy: 0.9753 - val_loss: 0.1025 - val_accuracy: 0.9697
Epoch 6/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0718 -
accuracy: 0.9796 - val_loss: 0.0951 - val_accuracy: 0.9721
Epoch 7/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0615 -
accuracy: 0.9822 - val_loss: 0.0865 - val_accuracy: 0.9735
Epoch 8/10
469/469 [==============================] - 2s 5ms/step - loss: 0.0535 -
accuracy: 0.9851 - val_loss: 0.0800 - val_accuracy: 0.9761
Epoch 9/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0457 -
accuracy: 0.9868 - val_loss: 0.0829 - val_accuracy: 0.9754
Epoch 10/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0391 -
accuracy: 0.9888 - val_loss: 0.0784 - val_accuracy: 0.9757
Output
<keras.callbacks.History at 0x7f6bd453df10>
# keras imports for the dataset and building our neural network
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from keras.utils import np_utils
# to calculate accuracy
from sklearn.metrics import accuracy_score
Output
<keras.callbacks.History at 0x7f6bcfde47d0>