0% found this document useful (0 votes)
23 views38 pages

AI Final PDF

The document discusses various operations that can be performed on matrices and NumPy arrays like addition, subtraction, multiplication, inversion, transpose etc. It also discusses operations like finding minimum, maximum, mean, trace, rank and eigenvalues of matrices. Finally, it discusses performing similar operations on pandas DataFrames and analyzing the Titanic passenger data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views38 pages

AI Final PDF

The document discusses various operations that can be performed on matrices and NumPy arrays like addition, subtraction, multiplication, inversion, transpose etc. It also discusses operations like finding minimum, maximum, mean, trace, rank and eigenvalues of matrices. Finally, it discusses performing similar operations on pandas DataFrames and analyzing the Titanic passenger data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

WEEK:1,2

1. Write a Program to perform the following operations on matrices


a) Matrix addition
b) Matrix Subtraction
c) Matrix Multiplication
d) Matrix Inversion
e) Transpose of a Matrix

a) Matrix addition
import numpy as np
a=np.array([[1,2,3],[3,4,2],[2,4,3]])
b=np.array([[1,2,3],[3,4,2],[2,4,3]])
print("addition of matrix:\n",np.add(a,b))

OUTPUT:
addition of matrix:
[[2 4 6]
[6 8 4]

[4 8 6]]
b) Matrix Subtraction
print("substraction of matrix:\n",np.subtract(a,b))

OUTPUT:
substraction of matrix:
[[0 0 0]
[0 0 0]
[0 0 0]]
c) Matrix Multiplication
print("multiplication of matrix:\n",np.multiply(a,b))

OUTPUT:
multiplication of matrix:
[[ 1 4 9]
[ 9 16 4]
[ 4 16 9]]
d) Matrix Inversion
print("inversiotion of matrix:\n",np.linalg.inv(a))

OUTPUT:
inversiotion of matrix:
[[ 0.66666667 1. -1.33333333]
[-0.83333333 -0.5 1.16666667]
[ 0.66666667 0. -0.33333333]]
e) Transpose of a Matrix
print("transpose of matrix:\n",np.transpose(a))

OUTPUT:
transpose of matrix:
[[1 3 2]
[2 4 4]
[3 2 3]]
2. Write a Program to perform the following operations
a) Find the minimum and maximum element of the matrix
import numpy as np
a=np.array([[1,2,3],[3,4,2],[2,4,3]])
print("maximum value of matrix:\n",np.max(a))

OUTPUT:
maximum value of matrix:
4

b) Find the minimum and maximum element of each row in the matrix
print("minimum value of matrix:\n",np.min(a))

OUTPUT:
minimum value of matrix:
1

c) Find the minimum and maximum element of each column in the matrix
import numpy as np

A=np.array([[2,2,9],[2,5,8],[1,4,7]])
B=np.array([[9,5,1],[7,5,3],[8,9,6]])

#TO CALCULATE MINIMUM ELEMENT IN COLUMN WISE OF A MATRIX


print(np.max(A,axis=0))

#TO CALCULATE MAXIMUM ELEMENT IN COLUMN WISE OF A MATRIX


print(np.min(A,axis=0))
OUTPUT:
[2 5 9]
[1 2 7]
d) Find trace of the given matrix
import numpy as np

A=np.array([[2,2,9],[2,5,8],[1,4,7]])
B=np.array([[9,5,1],[7,5,3],[8,9,6]])

#TO CALCULATE TRACE OF A MATRIX print(np.trace(A))


OUTPUT:
14

e) Find rank of the given matrix


import numpy as np

A=np.array([[2,2,9],[2,5,8],[1,4,7]])
B=np.array([[9,5,1],[7,5,3],[8,9,6]])

#TO CALCULATE RANK OF A MATRIX


print(np.linalg.matrix_rank(A))
OUTPUT:
3

f) Find eigenvalues and eigenvectors of the given matrix


import numpy as np

# Define the matrices A and B


A = np.array([[2, 2, 9], [2, 5, 8], [1, 4, 7]]) B
= np.array([[9, 5, 1], [7, 5, 3], [8, 9, 6]])

# eigenvalues and eigenvectors for matrix A


eigenvalues_A, eigenvectors_A = np.linalg.eig(A)

print(eigenvalues_A)
OUTPUT:
[13.05054773+0.j 0.47472613+1.17633455j
0.47472613-1.17633455j]
import numpy as np

# Define the matrices A and B


A = np.array([[2, 2, 9], [2, 5, 8], [1, 4, 7]]) B
= np.array([[9, 5, 1], [7, 5, 3], [8, 9, 6]])

#eigenvalues and eigenvectors for matrix A


eigenvalues_A, eigenvectors_A = np.linalg.eig(A)

print(eigenvectors_A)

OUTPUT:
[[ 0.54476501+0.j 0.89728078+0.j
0.89728078-0.j ]
[ 0.65530983+0.j -0.05423648-0.36470464j
0.05423648+0.36470464j]
[ 0.52325913+0.j -0.140014 +0.19832352j -
0.140014 -0.19832352j]]
WEEK:3
3. Write a program for the following .
a. To generate an array of random numbers from a normal distribution for the
array of a given shape.
import numpy as np import
pandas as pd
# a.to generate an array of random numbers from normal
distribution+
random_array =np.random.normal(size=(3,3))
print(random_array)

OUTPUT:
[[-0.72882502 -0.96261731 0.75660861]
[ 0.26895935 0.33844856 -0.12340368]
[ 0.27789698 1.16782929 0.73894252]]

b. Implement Arithmetic operations on two arrays (perform broadcasting also.)


array1=np.random.randint(1,8,size=(2,3)) print(array1)
array2=np.random.randint(1,4,size=(2,1)) print(array2)
ADD = array1 + array2 print(ADD)
MUL = array1*array2 print(MUL)

OUTPUT:
[[2 2 3]
[3 5 5]]
[[3]
[1]]
[[5 5 6]
[4 6 6]] [[6 6 9]
[3 5 5]]
c. Find minimum, maximum, mean in a given array. ( in both the axes )
# find min max mean of an array import
numpy as np
array1=np.random.randint(1,8,size=(3,3)) print(array1)
MIN=(np.min(array1,0)) MAX
=(np.max(array1,axis=1))
print(MIN) print(MAX)

OUTPUT:
[[1 5 2]
[3 1 6]
[6 5 2]]
[1 1 2]
[5 6 6]

d. Implement np.arange and np.linspace functions.


#implement np.arange and np.linspace functions
arange_array=np.arange(0,10,2)
linspace_array=np.linspace(0,10,2)
print("\n np.arange array:")
print(arange_array)
print("\n np.linspace array:")
print(linspace_array)

OUTPUT:
np.arange array:
[0 2 4 6 8]
np.linspace array:
[ 0. 10.]
e. Create a pandas series from a given list.
list_data=[1,2,6,5]
pandas_series=pd.Series(list_data)
print("\n pandas series")
print(pandas_series)

OUTPUT:
pandas series
0 1
1 2
2 6
3 5
dtype: int64

f. Create pandas series with data and index and display the index values.
import pandas as pd
data_with_index={'a':30,'b':90,'c':46}
pandas_series=pd.Series(data_with_index)
print("\n pandas series")
print(pandas_series)

OUTPUT:

pandas series
a 30
b 90
c 46
dtype: int64
g. Create a data frame with columns at least 5 observations
i. Select a particular column from the DataFrame
ii. Summarize the data frame and observe the stats of the DataFrame
created iii. Observe the mean and standard deviation of the data frame and
print the values.
data={'Column1':[1,2,3,4,5],
'Column2':[10,20,30,40,50],
'Column3':[5.5,6.5,7.1,8.9,3.1],
'Column4':['A','B','C','D','E'],
'Column5':[True,False,True,False,True]}
data_frame=pd.DataFrame(data)
print("\ng. DataFrames with 5 columns:")
print(data_frame)
OUTPUT:
g. DataFrames with 5 columns:
Column1 Column2 Column3 Column4 Column5
0 1 10 5.5 A True
1 2 20 6.5 B False
2 3 30 7.1 C True
3 4 40 8.9 D False
4 5 50 3.1 E True
WEEK:4
7. Write a Program to determine the following in the Titanic Survival data.

import numpy as np
import pandas as pd
df=pd.read_csv("titanic_data.csv") print(df.head())

A. Determine the data type of each column.


print("A. Data Types:")
print(df.dtypes) OUTPUT:
A. Data Types:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object dtype:
object
b. Find the number of non-null values in each column.
non_null_counts = df.notnull().sum()
print(non_null_counts)
OUTPUT:
PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64

c. Find out the unique values in each categorical column and frequency of each
unique value. categorical_columns = df.select_dtypes(include='object').columns
print(df.nunique)
OUTPUT:
Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q

[891 rows x 12 columns]>

print(df.Sex.unique) OUTPUT:
1 female
2 female
3 female
4 male

886 male
887 female
888 female
889 male
890 male
Name: Sex, Length: 891, dtype: object> print(df.Pclass.unique) OUTPUT:
1 12
3
3 1
4 3
..
886 2 887
1 888 3
889 1
890 3
Name: Pclass, Length: 891, dtype: int64>

d. Find the number of rows where age is greater than


the mean age of data. age_mean = df['Age'].mean()
print(age_mean)
print(np.sum(df['Age']>age_mean)) OUTPUT:
29.69911764705882
330

e. Delete all the rows with missing values.


titanic_data_cleaned = df.dropna()
print(titanic_data_cleaned)
OUTPUT:
PassengerId Survived Pclass ... Fare Cabin Embarked
1 2 1 1 ... 71.2833 C85 C
3 4 1 1 ... 53.1000 C123 S
6 7 0 1 ... 51.8625 E46 S
10 11 1 3 ... 16.7000 G6 S 11
12 1 1 ... 26.5500 C103 S
.. ... ... ... ... ... ... ...
871 872 1 1 ... 52.5542 D35 S
872 873 0 1 ... 5.0000 B51 B53 B55 S
879 880 1 1 ... 83.1583 C50 C
887 888 1 1 ... 30.0000 B42 S
889 890 1 1 ... 30.0000 C148 C

[183 rows x 12 columns]


WEEK:5
8. Perform Data Analysis on the Titanic Data Set to answer the following.

import numpy as np import


pandas as pd
df=pd.read_csv("titanic_data.csv")
print(df.head())

A. Information regarding each column of the data print(df.info())


OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890 Data
columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes:
float64(2), int64(5), object(5)

B. Impact of each column on the label


print(df.corr(method='pearson', min_periods=1, numeric_only=True)) OUTPUT:
PassengerId Survived Pclass ... SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 ... -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 ... -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 ... 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 ... -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 ... 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 ... 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 ... 0.159651 0.216225 1.000000

[7 rows x 7 columns]

C. Number of survivals in each gender


print(df.Survived.value_counts()) OUTPUT:
Survived
0 549 1
342
Name: count, dtype: int64
print(df.Sex.value_counts()) OUTPUT:
Sex male
577 female
314
Name: count, dtype: int64

gender_survival = df.groupby('Sex')['Survived'].sum()
print(gender_survival) OUTPUT:
Sex female
233 male
109
Name: Survived, dtype: int64

D. Number of survivals in each passenger class


print(df.Pclass.value_counts()) OUTPUT:
Pclass
3 491
1 216
2 184
Name: count, dtype: int64 print(df.Survived.value_counts()) OUTPUT:
Survived
0 549 1 342
Name: count, dtype: int64
# Group by 'Pclass' and count the number of survivors
class_survival = df.groupby('Pclass')['Survived'].sum()
print(class_survival) OUTPUT:
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64

E. The number of people who are not alone.


# Create a new column 'Alone' indicating whether the passenger is alone or not
df['not_Alone'] = (df['SibSp'] + df['Parch']) > 0 print(df.SibSp.value_counts())
OUTPUT:
SibSp
0 608
1 209
2 28
4 18
3 16
8 75
5
Name: count, dtype: int64

print(df.Parch.value_counts()) OUTPUT:
Parch
0 678
1 118
2 80
5 5
3 5
4 46
1
Name: count, dtype: int64

print((df['SibSp'] + df['Parch']).value_counts()) OUTPUT:


0 537
1 161
2 102
3 29

5 22 4
15 6
12

10 7
7 6
Name: count, dtype: int64

# Count the number of people who are not alone


not_alone_count = df['not_Alone'].sum()
print(not_alone_count) OUTPUT:
35
WEEK:6
9. Perform Data Analysis on the California House Price data to answer the
following

import pandas as pd
df=pd.read_csv("housing.csv") print(df)
OUTPUT:
longitude latitude ... median_house_value ocean_proximity
0 -122.23 37.88 ... 452600.0 NEAR BAY 1
-122.22 37.86 ... 358500.0 NEAR BAY 2 -
122.24 37.85 ... 352100.0 NEAR BAY 3 -
122.25 37.85 ... 341300.0 NEAR BAY 4 -
122.25 37.85 ... 342200.0 NEAR BAY
... ... ... ... ...
20635 -121.09 39.48 ... 78100.0 INLAND 20636
-121.21 39.49 ... 77100.0 INLAND
20637 -121.22 39.43 ... 92300.0 INLAND
20638 -121.32 39.43 ... 84700.0 INLAND 20639 -
121.24 39.37 ... 89400.0 INLAND

[20640 rows x 10 columns]

A.Data Type of each column and info regarding each column.


print(df.dtypes)
print(df.info())
OUTPUT:
longitude float64 latitude
float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity object dtype:
object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639 Data
columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64 4 total_bedrooms
20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64 8 median_house_value
20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes:
float64(9), object(1)
b. The average age of a house in the data set.
average_age_house=df['housing_median_age'].mean
print(average_age_house) OUTPUT:
1 21.0 2
52.0 3
52.0 4
52.0

20635 25.0
20636 18.0 20637 17.0 20638 18.0
20639 16.0
Name: housing_median_age, Length: 20640, dtype: float64>

c. Determines top 10 localities with the high difference between income and
house value. Also, top 10 localities that have the lowest difference
df['New_column']=df['median_house_value']-df['median_income']
print(df['New_column']) OUTPUT:
4861 500000.5001
6688 500000.5001
16642 500000.2975
15661 500000.1457 15652
500000.1000
5887 17497.6333
19802 14998.4640
2521 14997.3393 2799
14996.9000
9188 14994.8068
Name: New_column, Length: 20640, dtype: float64

df.sort_values(by='New_column',ascending=False,inplace=True)
print(df.head(10)) OUTPUT:
4861 -118.28 34.02 ... <1H OCEAN 500000.5001
6688 -118.08 34.15 ... INLAND 500000.5001
16642 -120.67 35.30 ... NEAR OCEAN 500000.2975
15661 -122.42 37.78 ... NEAR BAY 500000.1457
15652 -122.41 37.80 ... NEAR BAY 500000.1000
6639 -118.15 34.15 ... <1H OCEAN 499999.8333
459 -122.25 37.87 ... NEAR BAY 499999.8304
89 -122.27 37.80 ... NEAR BAY 499999.7566
10448 -117.67 33.47 ... <1H OCEAN 499999.3607 17819
-121.90 37.39 ... <1H OCEAN 499999.2639
[10 rows x 11 columns]

print(df.tail(10))
OUTPUT:
2779 -114.65 32.79 ... INLAND 24999.1429
16186 -121.29 37.95 ... INLAND 22499.2083
14326 -117.16 32.71 ... NEAR OCEAN 22498.9082
1825 -122.32 37.93 ... NEAR BAY 22497.3250
13889 -116.57 35.43 ... INLAND 22497.2862
5887 -118.33 34.15 ... <1H OCEAN 17497.6333
19802 -123.17 40.31 ... INLAND 14998.4640
2521 -122.74 39.71 ... INLAND 14997.3393 2799
-117.02 36.40 ... INLAND 14996.9000 9188 -
117.86 34.24 ... INLAND 14994.8068
[10 rows x 11 columns]
d. What is the ratio of bedrooms to total rooms in the data

A=df['total_bedrooms'].sum() print(A)
OUTPUT:
10990309.0

B=df['total_rooms'].sum() print(B)
OUTPUT:
54402150.0

ratio=A/B
print(ratio)
OUTPUT:
0.2020197547339581
e. Determine the average price of a house for each type of ocean_proximity.
Avg_house_price=df.groupby('ocean_proximity')['median_house_value'].mean()
print(Avg_house_price) OUTPUT:
ocean_proximity
<1H OCEAN 240084.285464
INLAND 124805.392001
ISLAND 380440.000000
NEAR BAY 259212.311790
NEAR OCEAN 249433.977427
Name: median_house_value, dtype: float64
WEEK:7
10. Write a program to perform the following tasks
a. Determine the outliers in each non-categorical column of Titanic Data and
remove them.
numerical_columns=df.select_dtypes(exclude='object').columns
print(numerical_columns) for column in numerical_columns:
q1=df[column].quantile(0.25) q3=df[column].quantile(0.75)
iqr = q3-q1 low_b=q1-1.5*iqr up_b=q3+1.5*iqr
df1=df[(df[column] >= low_b) & (df[column] <= up_b)] print(df1)

OUTPUT:
Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'],
dtype='object')
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S5 6 0 3 ... 8.4583
NaN Q .. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[775 rows x 12 columns]

b. Determine missing values in each column of Titanic data. If missing values


account for 30% of data, then remove the column.
th_p=30
missing_p= df.isnull().mean()*100 print(missing_p)
remove_c= missing_p[missing_p >= th_p].index
df2= df.drop(columns=remove_c) print(df2)
OUTPUT:
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 19.865320
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 77.104377
Embarked 0.224467 dtype:
float64
PassengerId Survived Pclass ... Ticket Fare Embarked
0 1 0 3 ... A/5 21171 7.2500 S
1 2 1 1 ... PC 17599 71.2833 C
2 3 1 3 ... STON/O2. 3101282 7.9250 S
3 4 1 1 ... 113803 53.1000 S
4 5 0 3 ... 373450 8.0500 S .. ... ... ... ...
... ... ...
886 887 0 2 ... 211536 13.0000 S 887
888 1 1 ... 112053 30.0000 S
888 889 0 3 ... W./C. 6607 23.4500 S
889 890 1 1 ... 111369 30.0000 C
890 891 0 3 ... 370376 7.7500 Q

[891 rows x 11 columns]

c. If missing values are less than 30% of entire data then create a new data frame
1. Missing values in numeric columns are filled with the mean of the
corresponding column.
2. Missing values in categorical columns are filled with the most frequently
occurring value.

num = df.select_dtypes(exclude='object').columns
cat = df.select_dtypes(include='object').columns
print("\n",num) print("\n",cat)

for column in num:


df[column].fillna(df[column].mean(), inplace=True) for
column in cat:
df[column].fillna(df[column].mode()[0], inplace=True)
print(df.head()) OUTPUT:
Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'],
dtype='object')

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')


PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 B96 B98 S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 B96 B98 S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 B96 B98 S

[5 rows x 12 columns]
WEEK:9
11. Write a program to perform the following tasks

a. Determine the categorical columns in Titanic Dataset. Convert Columns


with string data type to numerical data using encoding techniques.

b. Convert data in each numerical column so that it lies in the range [0,1]

import pandas as pd
titanic_dataset=pd.read_csv("titanic_dataset.csv") df=titanic_dataset
print(df) output:

PassengerId Survived Pclass ... Fare Cabin Embarked

0 1 0 3 ... 7.2500 NaN S

1 2 1 1 ... 71.2833 C85


………………………………………………………………………………………………………………
………………………………………………………………………………………………………..
S

888 889 0 3 ... 23.4500 NaN S


889 890 1 1 ... 30.0000 C148 C

890 891 0 3 ... 7.7500 NaN Q

[891 rows x 12 columns]


a. Determine the categorical columns in Titanic Dataset. Convert Columns
with string data type to numerical data using encoding techniques.
categorical_columns=df.select_dtypes(include='object').columns

print(categorical_columns)

print("\ncategorical columns:",categorical_columns)

df=pd.get_dummies(df,columns=categorical_columns
) print(df.head()) OUTPUT:
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

categorical columns: Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'],


dtype='object')
PassengerId Survived Pclass ... Embarked_C Embarked_Q Embarked_S

0 1 0 3 ... 0 0 1
1 2 1 1 ... 1 0 0
2 3 1 3 ... 0 0 1
3 4 1 1 ... 0 0 1
4 5 0 3 ... 0 0 1

5 rows x 1731 columns]


Convert data in each numerical column so that it lies in the range
[0,1]

from sklearn.preprocessing import MinMaxScaler


numerical_columns=df.select_dtypes(exclude='object').columns
scaler=MinMaxScaler()
df[numerical_columns]=scaler.fit_transform(df[numerical_columns])
print(df.head())

OUTPUT:
PassengerId Survived Pclass ... Embarked_C Embarked_Q
Embarked_S

0 0.000000 0.0 1.0 ... 0.0 0.0 1.0


1 0.001124 1.0 0.0 ... 1.0 0.0 0.0
2 0.002247 1.0 1.0 ... 0.0 0.0 1.0
3 0.003371 1.0 0.0 ... 0.0 0.0 1.0
4 0.004494 0.0 1.0 ... 0.0 0.0 1.0

[5 rows x 1731 columns]


WEEK 10
12.Implement the following models on Titanic Dataset and determine
the values of accuracy, precision, recall, f1 score and confusion matrix
for the test data.

a. Logistic Regression import


pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression from
sklearn.metrics import
accuracy_score,precision_score,recall_score,f1_score,confusion_matrix

titanic_df = pd.read_csv("titanic_dataset.csv")

print(titanic_df) print(titanic_df.info())

titanic_df['Age'].fillna(titanic_df['Age'].median(),inplace=True)
titanic_df.drop(['PassengerId','Name','Ticket','Cabin','Embarked','Sex'],
axis=1, inplace=True)

print(titanic_df)

X = titanic_df.drop('Survived',axis=1) y
= titanic_df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,


random_state = 2003)
#Tranining the logistic regression model log_model
= LogisticRegression() log_model.fit(X_train,
y_train)

#Making predictions on the test set y_pred


= log_model.predict(X_test)

#Calculating evalution metrics accuracy =


accuracy_score(y_test, y_pred) precision =
precision_score(y_test, y_pred) recall =
recall_score(y_test, y_pred) f1 =
f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:",accuracy)
print("Precision:",precision)
print("recall:",recall) print
("F1 Score:",f1) print
("Confusion Matrix:")
print(conf_matrix)

OUTPUT:-
runfile('C:/Users/simlab24/Desktop/manjari ai 33/week 12.py',
wdir='C:/Users/simlab24/Desktop/manjari
ai 3')
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
…………………………………………
………………………………………….
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q

[891 rows x 12 columns]


<class 'pandas.core.frame.DataFrame'>
…………………………………..
……………………………………
888 0 3 28.0 1 2 23.4500
889 1 1 26.0 0 0 30.0000
890 0 3 32.0 0 0 7.7500

[891 rows x 6 columns]


Accuracy: 0.7374301675977654
Precision: 0.7906976744186046
recall: 0.4722222222222222 F1
Score: 0.591304347826087
Confusion Matrix:
[[98 9]
[38 34]]
WEEK:11
13. Implement the following models on the California House Pricing
Dataset and determine the values of R2 score, the area under roc curve
and root mean squared error for the test set.
a. Linear Regression with Polynomial Features
b. Random Forest Regressor

import pandas as pd import


numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split from
sklearn.preprocessing import
PolynomialFeatures,StandardScaler
from sklearn.linear_model import LinearRegression from
sklearn.ensemble import RandomForestRegressor from
sklearn.metrics import r2_score,mean_squared_error

california_housing=fetch_california_housing()
x=california_housing.data y=california_housing.target

print(x) print(y)
x_train,x_test,y
_train,y_test=tr
ain_test_split(x,
y,test_size=0.2,r
andom_st
ate=42)
print(x_train,x_test,y_train,y_test)

scaler=StandardScaler()
x_train_scaled=scaler.fit_transform(x_train)
x_test_scaled=scaler.transform(x_test)

print(x_train_scaled)
print(x_test_scaled) OUTPUT:
[[ 8.3252 41. 6.98412698 ... 2.55555556 37.88 -122.23 ] [ 8.3014 21. 6.23813708 ...
2.10984183 37.86 -122.22 ] [ 7.2574 52. 8.28813559 ... 2.80225989 37.85 -122.24 ] ... [ 1.7
17. 5.20554273 ... 2.3256351 39.43 -121.22 ] [ 1.8672 18. 5.32951289 ... 2.12320917 39.43
121.32 ] [ 2.3886 16. 5.25471698 ... 2.61698113 39.37 -121.24 ]] [4.526 3.585 3.521 ... 0.923
0.847 0.894]

a. Linear Regression with Polynomial Features


poly=PolynomialFeatures(degree=2)
x_train_poly=poly.fit_transform(x_train_scaled)
x_test_poly=poly.transform(x_test_scaled)
print(x_train_poly) print(x_test_poly)
linear_reg=LinearRegression()
linear_reg.fit(x_train_poly,y_train)
y_pred_linear=linear_reg.predict(x_test_poly)
r2_linear=r2_score(y_test,y_pred_linear)
rmse_linear=np.sqrt(mean_squared_error(y_test,y_pred_linear))
print("r2 score:",r2_linear) print("root mean squared error:",
rmse_linear) OUTPUT: r2 score:0.6456819729261787 root
mean squared error: 0.6813967448044771
b. Random Forest Regressor
random_forest=RandomForestRegressor(n_estimators=100,rando
m_state=42) random_forest.fit(x_train_scaled,y_train)
y_pred_rf=random_forest.predict(x_test_scaled)
r2_rf=r2_score(y_test,y_pred_rf)
rmse_rf=np.sqrt(mean_squared_error(y_test,y_pred_rf)) print("R2
score :",r2_rf) print("root mean squared error:",rmse_rf)

OUTPUT:
R2 score : 0.8049425298843443 root mean
squared error: 0.5055739907397444

You might also like