Merged
Merged
1 ASSIGNMENT 01
Predict the price of the Uber ride from a given pickup point to the agreed drop-off location.
Perform following tasks: 1. Pre-process the dataset. 2. Identify outliers. 3. Check the correlation.
4. Implement linear regression and random forest regression models. 5. Evaluate the models and
compare their respective scores like R2, RMSE, etc.
[ ]: #importing necessary libraries
df = pd.read_csv("C://Users//91772//Desktop//ML assigns//uber.csv")
[ ]: df.head()
1
3 -73.965316 40.803349 3
4 -73.973082 40.761247 5
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB
[ ]: df.shape
[ ]: (200000, 9)
[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64
df.dropna(inplace = True)
[ ]: df.isnull().sum()
2
[ ]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64
df.drop(labels='Unnamed: 0',axis=1,inplace=True)
df.drop(labels='key',axis=1,inplace=True)
[ ]: df.head()
[ ]: df.dtypes
[ ]: fare_amount float64
pickup_datetime object
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object
[ ]: df.describe()
3
mean 11.359892 -72.527631 39.935881 -72.525292
std 9.901760 11.437815 7.720558 13.117408
min -52.000000 -1340.648410 -74.015515 -3356.666300
25% 6.000000 -73.992065 40.734796 -73.991407
50% 8.500000 -73.981823 40.752592 -73.980093
75% 12.500000 -73.967154 40.767158 -73.963658
max 499.000000 57.418457 1644.421482 1153.572603
dropoff_latitude passenger_count
count 199999.000000 199999.000000
mean 39.923890 1.684543
std 6.794829 1.385995
min -881.985513 0.000000
25% 40.733823 1.000000
50% 40.753042 1.000000
75% 40.768001 2.000000
max 872.697628 208.000000
3 2. Identify outliers.
OUTLIER: An object that deviates significantly from the rest of the objects.
[ ]: # data visualization
# plotting distribution plot
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
sns.distplot(df['fare_amount'])
[ ]: <AxesSubplot:xlabel='fare_amount', ylabel='Density'>
4
[ ]: sns.distplot(df['pickup_latitude'])
[ ]: <AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>
5
[ ]: sns.distplot(df['pickup_longitude'])
[ ]: <AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>
[ ]: sns.distplot(df['dropoff_longitude'])
[ ]: <AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>
6
[ ]: sns.distplot(df['dropoff_latitude'])
[ ]: <AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>
7
[ ]: #creating a function to identify outliers
def find_outliers_IQR(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3-q1
outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
return outliers
[ ]: #getting outlier details for column "fair_amount" using the above function
outliers = find_outliers_IQR(df["fare_amount"])
print("number of outliers: "+ str(len(outliers)))
print("max outlier value: "+ str(outliers.max()))
print("min outlier value: "+ str(outliers.min()))
outliers
[ ]: 6 24.50
30 25.70
34 39.50
39 29.00
48 56.80
…
199976 49.70
199977 43.50
199982 57.33
199985 24.00
199997 30.90
Name: fare_amount, Length: 17166, dtype: float64
[ ]: #you can also pass two columns as argument to the function (here␣
↪"passenger_count" and "fair_amount")
outliers = find_outliers_IQR(df[["passenger_count","fare_amount"]])
outliers
[ ]: passenger_count fare_amount
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 5.0 NaN
8
… … …
199995 NaN NaN
199996 NaN NaN
199997 NaN 30.9
199998 NaN NaN
199999 NaN NaN
[ ]: #upper and lower limit which can be used for capping of outliers
41.06517154774204
-18.3453884488253
corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()
9
[ ]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month",␣
↪"year", "weekday"
import calendar
df['day']=df['pickup_datetime'].apply(lambda x:x.day)
df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
df['month']=df['pickup_datetime'].apply(lambda x:x.month)
df['year']=df['pickup_datetime'].apply(lambda x:x.year)
df['weekday']=df['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <module>
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
10
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self,␣
↪func, convert_dtype, args, **kwargs)
4358
4359 def _reduce(
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in␣
↪apply_standard(self)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.
↪lib.map_infer()
~\AppData\Local\Temp/ipykernel_20560/3136844708.py in <lambda>(x)
3
4 import calendar
----> 5 df['day']=df['pickup_datetime'].apply(lambda x:x.day)
6 df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
7 df['month']=df['pickup_datetime'].apply(lambda x:x.month)
df.weekday = df.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})
[ ]: df.head()
[ ]: df.info()
11
[ ]: #splitting the data into train and test
x=df.drop("fare_amount", axis=1)
x
y=df["fare_amount"]
[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.
↪2,random_state=101)
[ ]: x_train.head()
[ ]: x_test.head()
[ ]: y_train.head()
[ ]: y_test.head()
[ ]: print(x_train.shape)
print(x_test.shape)
print(y_test.shape)
print(y_train.shape)
[ ]: predictedvalues = lrmodel.predict(x_test)
12
from sklearn.metrics import mean_squared_error
lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)
[ ]: rfrmodel.fit(x_train,y_train)
rfrmodel_pred= rfrmodel.predict(x_test)
rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))
print("RMSE value for Random forest regression is ",rfrmodel_rmse)
[ ]: rfrmodel_pred.shape
[ ]: test.head()
[ ]: test.isnull().sum()
test["pickup_datetime"] = pd.to_datetime(test["pickup_datetime"])
test['day']=test['pickup_datetime'].apply(lambda x:x.day)
test['hour']=test['pickup_datetime'].apply(lambda x:x.hour)
test['month']=test['pickup_datetime'].apply(lambda x:x.month)
test['year']=test['pickup_datetime'].apply(lambda x:x.year)
test['weekday']=test['pickup_datetime'].apply(lambda x: calendar.day_name[x.
↪weekday()])
13
test.weekday = test.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':
↪3,'Thursday':4,'Friday':5,'Saturday':6})
test.head(5)
[ ]: #Prediction!
rfrmodel_pred= rfrmodel.predict(test)
[ ]: df_pred = pd.DataFrame(rfrmodel_pred)
df_pred
df_pred.to_csv('pred.csv')
[ ]:
14
ps-05-gradient-descent
[5]: x = sym.symbols('x')
expr = (x+3)**2.0;
grad = sym.Derivative(expr,x)
print("{}".format(grad.doit()) )
grad.doit().subs(x,2)
2.0*(x + 3)**1.0
1
[5]:
10.0
[6]: def gradient_descent1(expr,alpha, start, max_iter):
x_list = list()
x = sym.symbols('x')
grad = sym.Derivative(expr,x).doit()
x_val= start;
x_list.append(x_val)
for i in range(max_iter):
gradient = grad.subs(x,x_val);
x_val = x_val - (alpha*gradient);
x_list.append(x_val);
return x_list
2
[9]: X = gradient_descent(alpha,start,max_iter)
x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))
X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()
[10]: X= gradient_descent1(expr,alpha,start,max_iter)
X_arr = np.array(X)
x_cordinate = np.linspace(-5,5,100)
pyplot.plot(x_cordinate,objective(x_cordinate))
X_arr = np.array(X)
pyplot.plot(X_arr, objective(X_arr), '.-', color='red')
pyplot.show()
3
4
eighbors-algorithm-on-diabetes-csv
1 ASSIGNMENT NO 5
Problem Statement:-Implement K-Nearest Neighbors algorithm on diabetes.csv dataset. Compute
confusion matrix, accuracy, error rate, precision and recall on the given dataset.
[1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, recall_score,␣
↪precision_score,accuracy_score
[3]: df.head()
[4]: df.shape
[4]: (768, 9)
[5]: df.describe()
1
[5]: Pregnancies Glucose BloodPressure SkinThickness Insulin \
count 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479
std 3.369578 31.972618 19.355807 15.952218 115.244002
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000
75% 6.000000 140.250000 80.000000 32.000000 127.250000
max 17.000000 199.000000 122.000000 99.000000 846.000000
[7]: df["Glucose"]
[7]: 0 148.0
1 85.0
2 183.0
3 89.0
4 137.0
…
763 101.0
764 122.0
765 121.0
766 126.0
767 93.0
Name: Glucose, Length: 768, dtype: float64
2
[9]: #feature Scaling
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)
[10]: knn=KNeighborsClassifier(n_neighbors=11)
[11]: knn.fit(X_train,y_train)
[11]: KNeighborsClassifier(n_neighbors=11)
[12]: y_pred=knn.predict(X_test)
3
[15]: tn, fp, fn, tp = confusion_matrix(y_test, y_pred ).ravel()
[17]: 0.8181818181818182
[18]: 0.7111111111111111
[19]: 0.6808510638297872
4
[20]: #error rate=1-accuracy which is lies bertween 0 and 1
error_rate=1-accuracy_score(y_test,y_pred)
[21]: error_rate
[21]: 0.18181818181818177
5
03-email-classification-using-knn
1 ASSIGNMENT NO 3
Classify the email using the binary classification method. Email Spam detection has two states:
a) Normal State – Not Spam, b) Abnormal State – Spam. Use K-Nearest Neighbors and Support
Vector Machine for classification. Analyze their performance. Dataset link: The emails.csv dataset
on the Kaggle https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
[3]: df=pd.read_csv('emails.csv')
[4]: df.head()
[4]: Email No. the to ect and for of a you hou … connevey jay \
0 Email 1 0 0 1 0 0 0 2 0 0 … 0 0
1 Email 2 8 13 24 6 6 2 102 1 27 … 0 0
2 Email 3 0 0 1 0 0 0 8 0 0 … 0 0
3 Email 4 0 5 22 0 5 1 51 2 10 … 0 0
4 Email 5 7 6 17 1 5 2 57 0 9 … 0 0
[5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
1
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB
[6]: df.isnull().sum()
[10]: KNeighborsClassifier()
[13]: cm
2
[14]: from sklearn.metrics import classification_report
cl_report=classification_report(y_test,y_pred)
print(cl_report)
3
clustering-on-sales-datasample-csv
data.head()
↪right-pointing double angle quotation mark, inverted question mark. This are␣
↪mapped as 0xef, 0xbb and 0xbf bytes respectively. Hence your error.
1
2 27 rue du Colonel Pierre Avia NaN Paris NaN
3 78934 Hillside Dr. NaN Pasadena CA
4 7734 Strong St. NaN San Francisco CA
[5 rows x 25 columns]
[4]: data.shape
[5]: ORDERNUMBER 0
QUANTITYORDERED 0
PRICEEACH 0
ORDERLINENUMBER 0
SALES 0
ORDERDATE 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
PHONE 0
ADDRESSLINE1 0
ADDRESSLINE2 2521
CITY 0
STATE 1486
POSTALCODE 76
COUNTRY 0
TERRITORY 1074
CONTACTLASTNAME 0
CONTACTFIRSTNAME 0
DEALSIZE 0
dtype: int64
2
[6]: data.drop(["ORDERNUMBER", "PRICEEACH", "ORDERDATE", "PHONE", "ADDRESSLINE1",␣
↪"ADDRESSLINE2", "CITY", "STATE", "TERRITORY", "POSTALCODE",␣
[7]: data.head()
DEALSIZE
0 Small
1 Small
2 Medium
3 Medium
4 Medium
[8]: data.isnull().sum()
[8]: QUANTITYORDERED 0
ORDERLINENUMBER 0
SALES 0
STATUS 0
QTR_ID 0
MONTH_ID 0
YEAR_ID 0
PRODUCTLINE 0
MSRP 0
PRODUCTCODE 0
CUSTOMERNAME 0
COUNTRY 0
DEALSIZE 0
dtype: int64
3
1 Exploratary Data Analysis
[9]: data.describe()
4
[11]: import seaborn as sns
Here we can see all the catagory lies in the range of price and hence in this we be creating a cluster
on targeting the same
[13]: data['PRODUCTLINE'].unique()
[15]: data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
5
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null object
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null object
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null object
10 CUSTOMERNAME 2823 non-null object
11 COUNTRY 2823 non-null object
12 DEALSIZE 2823 non-null object
dtypes: float64(1), int64(6), object(6)
memory usage: 308.8+ KB
[17]: list_cat
6
7
8
9
10
11
[19]: #dealing with the catagorical features
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
[20]: data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null float64
3 STATUS 2823 non-null int32
4 QTR_ID 2823 non-null int64
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: float64(1), int32(6), int64(6)
memory usage: 307.1 KB
[22]: data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2823 entries, 0 to 2822
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 QUANTITYORDERED 2823 non-null int64
1 ORDERLINENUMBER 2823 non-null int64
2 SALES 2823 non-null int32
3 STATUS 2823 non-null int32
4 QTR_ID 2823 non-null int64
12
5 MONTH_ID 2823 non-null int64
6 YEAR_ID 2823 non-null int64
7 PRODUCTLINE 2823 non-null int32
8 MSRP 2823 non-null int64
9 PRODUCTCODE 2823 non-null int32
10 CUSTOMERNAME 2823 non-null int32
11 COUNTRY 2823 non-null int32
12 DEALSIZE 2823 non-null int32
dtypes: int32(7), int64(6)
memory usage: 296.1 KB
[23]: data.describe()
[25]: data.columns
13
[25]: Index(['QUANTITYORDERED', 'ORDERLINENUMBER', 'SALES', 'STATUS', 'QTR_ID',
'MONTH_ID', 'YEAR_ID', 'PRODUCTLINE', 'MSRP', 'PRODUCTCODE',
'CUSTOMERNAME', 'COUNTRY', 'DEALSIZE'],
dtype='object')
[27]: kmeans.labels_
[28]: kmeans.inertia_
[28]: 1042223216.6249822
[29]: kmeans.n_iter_
[29]: 24
[30]: kmeans.cluster_centers_
Hence the NUmber of Clusters to be choosen Will be 4 according to the elbow method
[32]: sns.scatterplot(data=X, x="SALES", y="PRODUCTCODE", hue=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
marker="X", c="r", s=80, label="centroids")
plt.legend()
plt.show()
14
[ ]:
15