Python ML Projects
Python ML Projects
9.1 Regression
This project forecasts temperature using a numerical prediction model
with an advanced technique known as bias correction. The dataset used
for this project is publicly available at UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Bias+correction+of+numerical+pre
diction+model+temperature+forecast#
We import the required packages, and read csv file of the project as
follows.
import pandas as pd
import numpy as np
df = pd.read_csv(r'I:/Data science books/temperature.csv')
df.drop('Date',axis=1,inplace=True)
df.head()
Output:
df.describe
Output:
<bound method NDFrame.describe of
station Present_Tmax Present_Tmin LDAPS_RHmin LDAPS_RHmax \
0 1.0 28.7 21.4 58.255688 91.116364
1 2.0 31.9 21.6 52.263397 90.604721
2 3.0 31.6 23.3 48.690479 83.973587
3 4.0 32.0 23.4 58.239788 96.483688
4 5.0 31.4 21.9 56.174095 90.155128
... ... ... ... ... ...
7747 23.0 23.3 17.1 26.741310 78.869858
7748 24.0 23.3 17.7 24.040634 77.294975
7749 25.0 23.2 17.4 22.933014 77.243744
7750 NaN 20.0 11.3 19.794666 58.936283
7751 NaN 37.6 29.9 98.524734 100.000153
y = np.asarray(df.Next_Tmax)
X = np.asarray(df.drop('Next_Tmax',axis=1))
X = np.nan_to_num(X)
y = np.nan_to_num(y)
print(np.isnan(X).sum())
print(np.isnan(X).sum())
Output:
0
0
Output:
(7752, 23)
Output:
The light color boxes in the aforementioned plot indicate a strong positive
correlation between features. The dark colored boxes represent strong
negatively correlated features. However, the purple color is an indication
of features almost independent of each other.
To apply linear regression model to the training data, we import necessary
libraries and packages. We also find and display the mean absolute error
of the result to assess the performance of the method.
Output:
Absolute Error: 1.181
Output:
Mean Squared Error: 2.325
9.2 Classification
This project aims to detect and recognize different English language
accents. We use the speaker accent recognition dataset from UCI
Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/Speaker+Accent+Recognition#
This dataset contains single English words read by speakers from six
different countries. This is a classification problem because we want to
predict from 6 different accents / classes. We import required libraries and
the dataset as follows.
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# default seaborn settings
sns.set()
df = pd.read_csv(r'I:/Data science books//accent-mfcc.csv')
df.head()
Output:
df.describe
Output:
<bound method NDFrame.describe of language X1 X2 X3 X4 X5
X6 \
0 ES 7.071476 -6.512900 7.650800 11.150783 -7.657312 12.484021
1 ES 10.982967 -5.157445 3.952060 11.529381 -7.638047 12.136098
2 ES 7.827108 -5.477472 7.816257 9.187592 -7.172511 11.715299
3 ES 6.744083 -5.688920 6.546789 9.000183 -6.924963 11.710766
4 ES 5.836843 -5.326557 7.472265 8.847440 -6.773244 12.677218
.. ... ... ... ... ... ... ...
324 US -0.525273 -3.868338 3.548304 1.496249 3.490753 5.849887
325 US -2.094001 -1.073113 1.217397 -0.550790 2.666547 7.449942
326 US 2.116909 -4.441482 5.350392 3.675396 2.715876 3.682670
327 US 0.299616 0.324844 3.299919 2.044040 3.634828 6.693840
328 US 3.214254 -3.135152 1.122691 4.712444 5.926518 6.915566
We find the correlation between features using the following Python script.
import seaborn as sns
plt.figure(figsize=(22,22))
ax = sns.heatmap(df.corr(), annot=True, annot_kws={"size": 20})
col_ax = plt.gcf().axes[-1]
col_ax.tick_params(labelsize=20)
plt.show()
Output:
pca = PCA(n_components=2)
X = np.asarray(df.drop('language',axis=1))
proj = pca.fit_transform(X)
plt.scatter(proj[:, 0], proj[:, 1], c=y_encoded, cmap='rainbow_r')
plt.colorbar()
plt.show()
Output:
Note that we use test_size=0.3 that means 30% of examples are assigned
to the test set. Here, we have used a random forest classifier that is an
extension of a decision tree classifier. In contrast to a decision tree-based
classifier, a random forest classifier grows multiple trees on slightly
different versions of the same dataset. This allows us to predict the output
class using all the grown trees of a random forest classifier. Each tree
votes for the class it predicts. Finally, we chose the class by the majority
vote. We have used 100 estimators, i.e., the number of trees in the
random forest classifier. Thus, for each test point, the class that get
maximum number of votes out of 100 votes is assigned to that test point.
Now we make predictions, and display the classification report.
y_pred = M.predict(Xtest)
from sklearn.metrics import classification_report
print(classification_report(ytest,y_pred,target_names=df.language.uniqu
e()))
Output:
precision recall f1-score support
accuracy 0.74 99
macro avg 0.69 0.66 0.67 99
weighted avg 0.75 0.74 0.74 99
Output:
Text(89.18, 0.5, 'predicted label')
In this dataset, each color picture is centered on a single face. Each pixel
of the color image is encoded by a float in the range 0.0 - 1.0. We import
libraries and download the dataset.
Output:
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015
When we run this code, the dataset starts to download. It may take
sometime to download the dataset depending upon the speed of the
internet connection. We start exploring the dataset. To check the number
of rows and column of the dataset, we may type:
faces.data.shape
Output:
(1560, 2914)
There are 1560 images each having a total of 2914 pixels. To check the
shape of an individual image, we may type the following command.
faces.images[0].shape
Output:
(62, 47)
It shows that each image has a pixel grid of 62 rows and 47 columns. We
display the names of the persons whose images are present in the
dataset.
faces.target_names
Output:
array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
'Gerhard Schroeder', 'Hugo Chavez', 'Jacques Chirac',
'Jean Chretien', 'John Ashcroft', 'Junichiro Koizumi',
'Serena Williams', 'Tony Blair'], dtype='<U17')
faces.target_names.size
Output:
12
np.unique(faces.target)
Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int64)
We can also display the names of the images by typing the following
command.
faces.target_names[4]
Output:
'Gerhard Schroeder'
Output:
#Support vector machine (SVM) model with radial basis function (rbf)
kernel
svmModel = SVC(kernel='rbf',class_weight='balanced')
mdl = make_pipeline(pcaModel,svmModel)
grid.fit(Xtrain,ytrain)
Output:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('pca',
PCA(copy=True, iterated_power='auto',
n_components=150, random_state=None,
svd_solver='auto', tol=0.0,
whiten=True)),
('svc',
SVC(C=1.0, cache_size=200,
class_weight='balanced', coef0=0.0,
decision_function_shape='ovr',
degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1,
probability=False,
random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='warn', n_jobs=None,
param_grid={'svc__C': [1, 5, 15, 30],
'svc__gamma': [1e-05, 5e-05, 0.0001, 0.005]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
print(grid.best_params_)
Output:
{'svc__C': 1, 'svc__gamma': 0.005}
mdl = grid.best_estimator_
y_pred = mdl.predict(Xtest)
fig,ax = plt.subplots(5,7)
for idx , axidx in enumerate(ax.flat):
axidx.imshow(Xtest[idx].reshape(62,47),cmap='bone')
axidx.set(xticks=[],yticks=[])
axidx.set_ylabel(faces.target_names[y_pred[idx]].split()[-
1],color='green' if y_pred[idx]==ytest[idx] else 'red')
fig.suptitle('Wrong are in red',size=14)
Output:
To assess the performance of the proposed support vector classifier, we
generate the classification report as follows.
Output:
Output:
Text(89.18, 0.5, 'predicted label')
The true and predicted labels are shown on x and y axis of the confusion
matrix, respectively. The diagonal entries on the confusion matrix
represent correct classification results. It can be observed that most
images are correctly classified by the model. However, occasional
misclassified results are shown on the off-diagonal entries of the matrix.
For example, Tony Blair is misclassified as George W Bush 9 times and
Donald Rumsfeld is wrongly predicted as George W Bush 6 times.