0% found this document useful (0 votes)
4 views

Python ML Projects

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Python ML Projects

Uploaded by

Malik Arslan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Science Projects

To give the reader a better understanding, this chapter presents 3 data


science projects. The first project is to perform weather forecasting. It is a
regression problem because we predict the temperature of the next day
based on measurements of weather variables from the previous days. The
second project deals with the task of accent recognition of people from
English speaking countries, whereas the third project builds a model to
recognize human faces. The last 2 projects solve classification problems
because there are a discrete set of output labels in both accent and face
recognition tasks. We give details of these projects in the
following sections.

9.1 Regression
This project forecasts temperature using a numerical prediction model
with an advanced technique known as bias correction. The dataset used
for this project is publicly available at UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Bias+correction+of+numerical+pre
diction+model+temperature+forecast#

We import the required packages, and read csv file of the project as
follows.

import pandas as pd
import numpy as np
df = pd.read_csv(r'I:/Data science books/temperature.csv')
df.drop('Date',axis=1,inplace=True)
df.head()

Output:

To show features and observations, we may type.

df.describe
Output:
<bound method NDFrame.describe of
station Present_Tmax Present_Tmin LDAPS_RHmin LDAPS_RHmax \
0 1.0 28.7 21.4 58.255688 91.116364
1 2.0 31.9 21.6 52.263397 90.604721
2 3.0 31.6 23.3 48.690479 83.973587
3 4.0 32.0 23.4 58.239788 96.483688
4 5.0 31.4 21.9 56.174095 90.155128
... ... ... ... ... ...
7747 23.0 23.3 17.1 26.741310 78.869858
7748 24.0 23.3 17.7 24.040634 77.294975
7749 25.0 23.2 17.4 22.933014 77.243744
7750 NaN 20.0 11.3 19.794666 58.936283
7751 NaN 37.6 29.9 98.524734 100.000153

LDAPS_Tmax_lapse LDAPS_Tmin_lapse LDAPS_WS LDAPS_LH LDAPS_CC1 \


0 28.074101 23.006936 6.818887 69.451805 0.233947
1 29.850689 24.035009 5.691890 51.937448 0.225508
2 30.091292 24.565633 6.138224 20.573050 0.209344
3 29.704629 23.326177 5.650050 65.727144 0.216372
4 29.113934 23.486480 5.735004 107.965535 0.151407
... ... ... ... ... ...
7747 26.352081 18.775678 6.148918 72.058294 0.030034
7748 27.010193 18.733519 6.542819 47.241457 0.035874
7749 27.939516 18.522965 7.289264 9.090034 0.048954
7750 17.624954 14.272646 2.882580 -13.603212 0.000000
7751 38.542255 29.619342 21.857621 213.414006 0.967277

... LDAPS_PPT2 LDAPS_PPT3 LDAPS_PPT4 lat lon DEM \


0 ... 0.000000 0.000000 0.000000 37.6046 126.991 212.3350
1 ... 0.000000 0.000000 0.000000 37.6046 127.032 44.7624
2 ... 0.000000 0.000000 0.000000 37.5776 127.058 33.3068
3 ... 0.000000 0.000000 0.000000 37.6450 127.022 45.7160
4 ... 0.000000 0.000000 0.000000 37.5507 127.135 35.0380
... ... ... ... ... ... ... ...
7747 ... 0.000000 0.000000 0.000000 37.5372 126.891 15.5876
7748 ... 0.000000 0.000000 0.000000 37.5237 126.909 17.2956
7749 ... 0.000000 0.000000 0.000000 37.5237 126.970 19.5844
7750 ... 0.000000 0.000000 0.000000 37.4562 126.826 12.3700
7751 ... 21.621661 15.841235 16.655469 37.6450 127.135 212.3350

Slope Solar radiation Next_Tmax Next_Tmin


0 2.785000 5992.895996 29.1 21.2
1 0.514100 5869.312500 30.5 22.5
2 0.266100 5863.555664 31.1 23.9
3 2.534800 5856.964844 31.7 24.3
4 0.505500 5859.552246 31.2 22.5
... ... ... ... ...
7747 0.155400 4443.313965 28.3 18.1
7748 0.222300 4438.373535 28.6 18.8
7749 0.271300 4451.345215 27.8 17.4
7750 0.098475 4329.520508 17.4 11.3
7751 5.178230 5992.895996 38.9 29.8

[7752 rows x 24 columns]>

To convert a list or a tuple into an array we use np.asarray(). To replace


NaN values with 0s, we use np.nan_to_num(). Finally, we check the total
number of NaN values by using np.isnan().sum() as follows.

y = np.asarray(df.Next_Tmax)
X = np.asarray(df.drop('Next_Tmax',axis=1))
X = np.nan_to_num(X)
y = np.nan_to_num(y)
print(np.isnan(X).sum())
print(np.isnan(X).sum())

Output:
0
0

We observe that NaN values have been removed. StandardScaler from


sklearn.preprocessing transforms our data such that its distribution has
a mean value 0 and standard deviation 1. This process is known as
normalization of feature vectors, and is required for many machine
learning algorithms to perform better. We normalize our features as
follows.

from sklearn.preprocessing import StandardScaler


s = StandardScaler()
X = s.fit_transform(X)
X.shape

Output:
(7752, 23)

Before applying a machine learning model, let us determine the strength


of relationship between different feature vectors. We can, for example,
find the correlation between the feature vectors as follows.

import seaborn as sns


plt.figure(figsize=(22,22))
sns.heatmap(df.corr(), annot=True, annot_kws={"size": 10})
plt.show()

Output:
The light color boxes in the aforementioned plot indicate a strong positive
correlation between features. The dark colored boxes represent strong
negatively correlated features. However, the purple color is an indication
of features almost independent of each other.
To apply linear regression model to the training data, we import necessary
libraries and packages. We also find and display the mean absolute error
of the result to assess the performance of the method.

from sklearn.model_selection import train_test_split


Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2)
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(Xtrain,ytrain)
y_pred = m.predict(Xtest)
print('Absolute Error: %0.3f'%float(np.abs(ytest-y_pred).sum()/
len(y_pred)))

Output:
Absolute Error: 1.181

from sklearn.metrics import mean_squared_error


print('Mean Squared Error: %0.3f'% mean_squared_error(ytest, y_pred)

Output:
Mean Squared Error: 2.325

An absolute error 1.181 and an MSE 2.325 may be acceptable depending


upon the problem to be solved. However, these errors indicate that the
output variable does not have a perfect linear relationship with the input
features. If the output variable is almost linearly related to the feature
vectors, the error of linear regression model would be even less than what
is reported above.

9.2 Classification
This project aims to detect and recognize different English language
accents. We use the speaker accent recognition dataset from UCI
Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/Speaker+Accent+Recognition#

This dataset contains single English words read by speakers from six
different countries. This is a classification problem because we want to
predict from 6 different accents / classes. We import required libraries and
the dataset as follows.

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# default seaborn settings
sns.set()
df = pd.read_csv(r'I:/Data science books//accent-mfcc.csv')
df.head()

Output:
df.describe

Output:
<bound method NDFrame.describe of language X1 X2 X3 X4 X5
X6 \
0 ES 7.071476 -6.512900 7.650800 11.150783 -7.657312 12.484021
1 ES 10.982967 -5.157445 3.952060 11.529381 -7.638047 12.136098
2 ES 7.827108 -5.477472 7.816257 9.187592 -7.172511 11.715299
3 ES 6.744083 -5.688920 6.546789 9.000183 -6.924963 11.710766
4 ES 5.836843 -5.326557 7.472265 8.847440 -6.773244 12.677218
.. ... ... ... ... ... ... ...
324 US -0.525273 -3.868338 3.548304 1.496249 3.490753 5.849887
325 US -2.094001 -1.073113 1.217397 -0.550790 2.666547 7.449942
326 US 2.116909 -4.441482 5.350392 3.675396 2.715876 3.682670
327 US 0.299616 0.324844 3.299919 2.044040 3.634828 6.693840
328 US 3.214254 -3.135152 1.122691 4.712444 5.926518 6.915566

X7 X8 X9 X10 X11 X12


0 -11.709772 3.426596 1.462715 -2.812753 0.866538 -5.244274
1 -12.036247 3.491943 0.595441 -4.508811 2.332147 -6.221857
2 -13.847214 4.574075 -1.687559 -7.204041 -0.011847 -6.463144
3 -12.374388 6.169879 -0.544747 -6.019237 1.358559 -6.356441
4 -12.315061 4.416344 0.193500 -3.644812 2.151239 -6.816310
.. ... ... ... ... ... ...
324 -7.747027 9.738836 -11.754543 7.129909 0.209947 -1.946914
325 -6.418064 10.907098 -11.134323 6.728373 2.461446 -0.026113
326 -4.500850 11.798565 -12.031005 7.566142 -0.606010 -2.245129
327 -5.676224 12.000518 -11.912901 4.664406 1.197789 -2.230275
328 -5.799727 10.858532 -11.659845 10.605734 0.349482 -5.983281

[329 rows x 13 columns]>

We find 12 numerical features and 1 output categorical variable describing


the classes.

We find the correlation between features using the following Python script.
import seaborn as sns
plt.figure(figsize=(22,22))
ax = sns.heatmap(df.corr(), annot=True, annot_kws={"size": 20})

col_ax = plt.gcf().axes[-1]
col_ax.tick_params(labelsize=20)
plt.show()

Output:

It can be observed that a strong correlation between features exist. Next,


we explore whether classes are overlapping or not. To this end, we apply
PCA to the features to get first 2 principal components. Next, we encode
string output labels to numbers to display the scatter plot of all 6 classes.
We may type the following Python script.

from sklearn.decomposition import PCA

from sklearn import preprocessing


y = np.asarray(df.language)
#creating label Encoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
y_encoded=le.fit_transform(y)

pca = PCA(n_components=2)
X = np.asarray(df.drop('language',axis=1))
proj = pca.fit_transform(X)
plt.scatter(proj[:, 0], proj[:, 1], c=y_encoded, cmap='rainbow_r')
plt.colorbar()
plt.show()

Output:

The classes are shown in 6 distinct colors. We observe a big overlap


between classes. We extract the target language in variable y and input
features in X Numpy array.

# Extraction of target variable and features, and storing them in Numpy


arrays using asarray ( )
y = np.asarray(df.language)
X = np.asarray(df.drop('language',axis=1))

# Importing the required libraries and packages


from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
# Specifying 100 estimators in a random forest classifier
M = RandomForestClassifier(100)

# Training the RandomForestClassifier, an ensemble tree based


classifer.
M.fit(Xtrain,ytrain)
Output:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

Note that we use test_size=0.3 that means 30% of examples are assigned
to the test set. Here, we have used a random forest classifier that is an
extension of a decision tree classifier. In contrast to a decision tree-based
classifier, a random forest classifier grows multiple trees on slightly
different versions of the same dataset. This allows us to predict the output
class using all the grown trees of a random forest classifier. Each tree
votes for the class it predicts. Finally, we chose the class by the majority
vote. We have used 100 estimators, i.e., the number of trees in the
random forest classifier. Thus, for each test point, the class that get
maximum number of votes out of 100 votes is assigned to that test point.
Now we make predictions, and display the classification report.

y_pred = M.predict(Xtest)
from sklearn.metrics import classification_report
print(classification_report(ytest,y_pred,target_names=df.language.uniqu
e()))

Output:
precision recall f1-score support

ES 0.75 0.67 0.71 9


FR 1.00 0.67 0.80 12
GE 0.50 0.60 0.55 5
IT 0.43 0.60 0.50 5
UK 0.67 0.62 0.65 16
US 0.78 0.83 0.80 52

accuracy 0.74 99
macro avg 0.69 0.66 0.67 99
weighted avg 0.75 0.74 0.74 99

We observe that an accuracy of 74% is reported by the random forest


classifier. One of the main reasons of not getting accuracy close to 100%
is the presence of overlapping classes as observed in the exploratory data
analysis. The accuracy of the model can be improved if we separate the
classes as much as possible.

To draw the confusion matrix, we type the following commands.


from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest,y_pred)
sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklab
els=df.language.unique(),yticklabels=df.language.unique())
plt.xlabel("True label")
plt.ylabel("predicted label")

Output:
Text(89.18, 0.5, 'predicted label')

The entries on the diagonal of the confusion matrix indicate correct


predictions. However, there are some misclassified points as well. For
example, 4 UK accents are wrongly classified as US accents, and 4 US
accents are misclassified as UK accents. Note that we have randomly split
the dataset into training and test sets. When we run the same Python
script again, we may get slightly different results because of random
assignment of the dataset examples as training and test examples.
9.3 Face Recognition
Our third project is on Face Recognition which deals with the problem:
given the picture of a face, find the name of the person given in a training
set. For this project, we use Labeled Faces in the Wild (LFW) people
dataset. This dataset is a collection of JPEG pictures of famous people
collected on the internet; all details of this dataset are available on the
official website:
http://vis-www.cs.umass.edu/lfw/

In this dataset, each color picture is centered on a single face. Each pixel
of the color image is encoded by a float in the range 0.0 - 1.0. We import
libraries and download the dataset.

# Importing libraries and packages


from sklearn.datasets import fetch_lfw_people
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
faces = fetch_lfw_people(min_faces_per_person=50) # requires internet
connection to download data for the first time

Output:
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015

When we run this code, the dataset starts to download. It may take
sometime to download the dataset depending upon the speed of the
internet connection. We start exploring the dataset. To check the number
of rows and column of the dataset, we may type:

faces.data.shape

Output:
(1560, 2914)

There are 1560 images each having a total of 2914 pixels. To check the
shape of an individual image, we may type the following command.

faces.images[0].shape

Output:
(62, 47)

It shows that each image has a pixel grid of 62 rows and 47 columns. We
display the names of the persons whose images are present in the
dataset.

faces.target_names

Output:
array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
'Gerhard Schroeder', 'Hugo Chavez', 'Jacques Chirac',
'Jean Chretien', 'John Ashcroft', 'Junichiro Koizumi',
'Serena Williams', 'Tony Blair'], dtype='<U17')

faces.target_names.size

Output:
12

np.unique(faces.target)
Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int64)

We can also display the names of the images by typing the following
command.

faces.target_names[4]
Output:
'Gerhard Schroeder'

To show one image or many images together, we type the following


command.
plt.imshow(faces.images[0])
Output:
<matplotlib.image.AxesImage at 0x2b9c8a12588>
# Plotting multiple images together
fig , ax = plt.subplots(2,4)
for idx,axidx in enumerate(ax.flat):
axidx.imshow(faces.images[idx],cmap='bone')
axidx.set(xticks=[],yticks=[],xlabel=faces.target_names[faces.target[id
x]])

Output:

To model our dataset, we import required machine learning libraries and


packages.

# Importing machine learning support vector classifier (SVC) and PCA


from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
Since neighboring pixels in any image are highly correlated, and cannot
be used directly into a machine learning algorithm, we transform our
images using principal component analysis.

# Using 150 PCA components to transform the images of the dataset.

# whiten ensures outputs with unit component-wise variances


pcaModel = PCA(n_components=150,whiten=True)

#Support vector machine (SVM) model with radial basis function (rbf)
kernel
svmModel = SVC(kernel='rbf',class_weight='balanced')
mdl = make_pipeline(pcaModel,svmModel)

# Splitting our dataset into training and test images


from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest =
train_test_split(faces.data,faces.target,test_size=0.2)

A support vector classifier uses hyper-parameters whose values affect the


prediction accuracy of the learnt classifier. These parameters have to be
estimated and their optimal values should be used for better accuracy of
the model. In scikit-learn, the parameters are passed as arguments to the
constructor of the estimator classes. It is possible and recommended to
search the hyper-parameter space for the best cross validation score.

Grid search is a technique that is used to estimate optimal value of the


hyper-parameters. Thus, we import and use GridSearchCV for best cross
validation score.

from sklearn.model_selection import GridSearchCV


param_grid =
{'svc__C':[1,5,15,30],'svc__gamma':[0.00001,0.00005,0.0001,0.005]}
grid = GridSearchCV(mdl,param_grid)

grid.fit(Xtrain,ytrain)
Output:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('pca',
PCA(copy=True, iterated_power='auto',
n_components=150, random_state=None,
svd_solver='auto', tol=0.0,
whiten=True)),
('svc',
SVC(C=1.0, cache_size=200,
class_weight='balanced', coef0=0.0,
decision_function_shape='ovr',
degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1,
probability=False,
random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='warn', n_jobs=None,
param_grid={'svc__C': [1, 5, 15, 30],
'svc__gamma': [1e-05, 5e-05, 0.0001, 0.005]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)

print(grid.best_params_)
Output:
{'svc__C': 1, 'svc__gamma': 0.005}

To make predictions, we type the following Python script.

mdl = grid.best_estimator_
y_pred = mdl.predict(Xtest)
fig,ax = plt.subplots(5,7)
for idx , axidx in enumerate(ax.flat):
axidx.imshow(Xtest[idx].reshape(62,47),cmap='bone')
axidx.set(xticks=[],yticks=[])
axidx.set_ylabel(faces.target_names[y_pred[idx]].split()[-
1],color='green' if y_pred[idx]==ytest[idx] else 'red')
fig.suptitle('Wrong are in red',size=14)

Output:
To assess the performance of the proposed support vector classifier, we
generate the classification report as follows.

from sklearn.metrics import classification_report


print(classification_report(ytest,y_pred,target_names=faces.target_name
s))

Output:

precision recall f1-score support

Ariel Sharon 0.92 0.80 0.86 15


Colin Powell 0.59 0.90 0.71 29
Donald Rumsfeld 0.77 0.96 0.86 25
George W Bush 0.90 0.88 0.89 129
Gerhard Schroeder 0.84 0.88 0.86 24
Hugo Chavez 0.92 0.85 0.88 13
Jacques Chirac 0.83 0.50 0.62 10
Jean Chretien 0.88 0.58 0.70 12
John Ashcroft 1.00 0.90 0.95 10
Junichiro Koizumi 1.00 1.00 1.00 8
Serena Williams 0.80 0.62 0.70 13
Tony Blair 0.76 0.67 0.71 24

accuracy 0.83 312


macro avg 0.85 0.79 0.81 312
weighted avg 0.85 0.83 0.83 312
It is evident form the report that we get an accuracy score of 83%. To
check performance of the method on individual classes, we plot the
confusion matrix as follows.

# Plotting a confusion matrix


from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest,y_pred)

# heatmap with string format fmt as decimal, colorbar is off


sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklab
els=faces.target_names,yticklabels=faces.target_names)
plt.xlabel("True label")
plt.ylabel("predicted label")

Output:
Text(89.18, 0.5, 'predicted label')

The true and predicted labels are shown on x and y axis of the confusion
matrix, respectively. The diagonal entries on the confusion matrix
represent correct classification results. It can be observed that most
images are correctly classified by the model. However, occasional
misclassified results are shown on the off-diagonal entries of the matrix.
For example, Tony Blair is misclassified as George W Bush 9 times and
Donald Rumsfeld is wrongly predicted as George W Bush 6 times.

You might also like