0% found this document useful (0 votes)

26 views18 pages

Python ML Projects

Uploaded by

Malik Arslan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views18 pages

Python ML Projects

Uploaded by

Malik Arslan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Science Projects

To give the reader a better understanding, this chapter presents 3 data

science projects. The first project is to perform weather forecasting. It is a
regression problem because we predict the temperature of the next day
based on measurements of weather variables from the previous days. The
second project deals with the task of accent recognition of people from
English speaking countries, whereas the third project builds a model to
recognize human faces. The last 2 projects solve classification problems
because there are a discrete set of output labels in both accent and face
recognition tasks. We give details of these projects in the
following sections.

9.1 Regression
This project forecasts temperature using a numerical prediction model
with an advanced technique known as bias correction. The dataset used
for this project is publicly available at UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Bias+correction+of+numerical+pre
diction+model+temperature+forecast#

We import the required packages, and read csv file of the project as
follows.

import pandas as pd
import numpy as np
df = pd.read_csv(r'I:/Data science books/temperature.csv')
df.drop('Date',axis=1,inplace=True)
df.head()

Output:

To show features and observations, we may type.

df.describe
Output:
<bound method NDFrame.describe of
station Present_Tmax Present_Tmin LDAPS_RHmin LDAPS_RHmax \
0 1.0 28.7 21.4 58.255688 91.116364
1 2.0 31.9 21.6 52.263397 90.604721
2 3.0 31.6 23.3 48.690479 83.973587
3 4.0 32.0 23.4 58.239788 96.483688
4 5.0 31.4 21.9 56.174095 90.155128
... ... ... ... ... ...
7747 23.0 23.3 17.1 26.741310 78.869858
7748 24.0 23.3 17.7 24.040634 77.294975
7749 25.0 23.2 17.4 22.933014 77.243744
7750 NaN 20.0 11.3 19.794666 58.936283
7751 NaN 37.6 29.9 98.524734 100.000153

LDAPS_Tmax_lapse LDAPS_Tmin_lapse LDAPS_WS LDAPS_LH LDAPS_CC1 \

0 28.074101 23.006936 6.818887 69.451805 0.233947
1 29.850689 24.035009 5.691890 51.937448 0.225508
2 30.091292 24.565633 6.138224 20.573050 0.209344
3 29.704629 23.326177 5.650050 65.727144 0.216372
4 29.113934 23.486480 5.735004 107.965535 0.151407
... ... ... ... ... ...
7747 26.352081 18.775678 6.148918 72.058294 0.030034
7748 27.010193 18.733519 6.542819 47.241457 0.035874
7749 27.939516 18.522965 7.289264 9.090034 0.048954
7750 17.624954 14.272646 2.882580 -13.603212 0.000000
7751 38.542255 29.619342 21.857621 213.414006 0.967277

... LDAPS_PPT2 LDAPS_PPT3 LDAPS_PPT4 lat lon DEM \

0 ... 0.000000 0.000000 0.000000 37.6046 126.991 212.3350
1 ... 0.000000 0.000000 0.000000 37.6046 127.032 44.7624
2 ... 0.000000 0.000000 0.000000 37.5776 127.058 33.3068
3 ... 0.000000 0.000000 0.000000 37.6450 127.022 45.7160
4 ... 0.000000 0.000000 0.000000 37.5507 127.135 35.0380
... ... ... ... ... ... ... ...
7747 ... 0.000000 0.000000 0.000000 37.5372 126.891 15.5876
7748 ... 0.000000 0.000000 0.000000 37.5237 126.909 17.2956
7749 ... 0.000000 0.000000 0.000000 37.5237 126.970 19.5844
7750 ... 0.000000 0.000000 0.000000 37.4562 126.826 12.3700
7751 ... 21.621661 15.841235 16.655469 37.6450 127.135 212.3350

Slope Solar radiation Next_Tmax Next_Tmin

0 2.785000 5992.895996 29.1 21.2
1 0.514100 5869.312500 30.5 22.5
2 0.266100 5863.555664 31.1 23.9
3 2.534800 5856.964844 31.7 24.3
4 0.505500 5859.552246 31.2 22.5
... ... ... ... ...
7747 0.155400 4443.313965 28.3 18.1
7748 0.222300 4438.373535 28.6 18.8
7749 0.271300 4451.345215 27.8 17.4
7750 0.098475 4329.520508 17.4 11.3
7751 5.178230 5992.895996 38.9 29.8

[7752 rows x 24 columns]>

To convert a list or a tuple into an array we use np.asarray(). To replace

NaN values with 0s, we use np.nan_to_num(). Finally, we check the total
number of NaN values by using np.isnan().sum() as follows.

y = np.asarray(df.Next_Tmax)
X = np.asarray(df.drop('Next_Tmax',axis=1))
X = np.nan_to_num(X)
y = np.nan_to_num(y)
print(np.isnan(X).sum())
print(np.isnan(X).sum())

Output:
0
0

We observe that NaN values have been removed. StandardScaler from

sklearn.preprocessing transforms our data such that its distribution has
a mean value 0 and standard deviation 1. This process is known as
normalization of feature vectors, and is required for many machine
learning algorithms to perform better. We normalize our features as
follows.

from sklearn.preprocessing import StandardScaler

s = StandardScaler()
X = s.fit_transform(X)
X.shape

Output:
(7752, 23)

Before applying a machine learning model, let us determine the strength

of relationship between different feature vectors. We can, for example,
find the correlation between the feature vectors as follows.

import seaborn as sns

plt.figure(figsize=(22,22))
sns.heatmap(df.corr(), annot=True, annot_kws={"size": 10})
plt.show()

Output:
The light color boxes in the aforementioned plot indicate a strong positive
correlation between features. The dark colored boxes represent strong
negatively correlated features. However, the purple color is an indication
of features almost independent of each other.
To apply linear regression model to the training data, we import necessary
libraries and packages. We also find and display the mean absolute error
of the result to assess the performance of the method.

from sklearn.model_selection import train_test_split

Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.2)
from sklearn.linear_model import LinearRegression
m = LinearRegression()
m.fit(Xtrain,ytrain)
y_pred = m.predict(Xtest)
print('Absolute Error: %0.3f'%float(np.abs(ytest-y_pred).sum()/
len(y_pred)))

Output:
Absolute Error: 1.181

from sklearn.metrics import mean_squared_error

print('Mean Squared Error: %0.3f'% mean_squared_error(ytest, y_pred)

Output:
Mean Squared Error: 2.325

An absolute error 1.181 and an MSE 2.325 may be acceptable depending

upon the problem to be solved. However, these errors indicate that the
output variable does not have a perfect linear relationship with the input
features. If the output variable is almost linearly related to the feature
vectors, the error of linear regression model would be even less than what
is reported above.

9.2 Classification
This project aims to detect and recognize different English language
accents. We use the speaker accent recognition dataset from UCI
Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/Speaker+Accent+Recognition#

This dataset contains single English words read by speakers from six
different countries. This is a classification problem because we want to
predict from 6 different accents / classes. We import required libraries and
the dataset as follows.

# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# default seaborn settings
sns.set()
df = pd.read_csv(r'I:/Data science books//accent-mfcc.csv')
df.head()

Output:
df.describe

Output:
<bound method NDFrame.describe of language X1 X2 X3 X4 X5
X6 \
0 ES 7.071476 -6.512900 7.650800 11.150783 -7.657312 12.484021
1 ES 10.982967 -5.157445 3.952060 11.529381 -7.638047 12.136098
2 ES 7.827108 -5.477472 7.816257 9.187592 -7.172511 11.715299
3 ES 6.744083 -5.688920 6.546789 9.000183 -6.924963 11.710766
4 ES 5.836843 -5.326557 7.472265 8.847440 -6.773244 12.677218
.. ... ... ... ... ... ... ...
324 US -0.525273 -3.868338 3.548304 1.496249 3.490753 5.849887
325 US -2.094001 -1.073113 1.217397 -0.550790 2.666547 7.449942
326 US 2.116909 -4.441482 5.350392 3.675396 2.715876 3.682670
327 US 0.299616 0.324844 3.299919 2.044040 3.634828 6.693840
328 US 3.214254 -3.135152 1.122691 4.712444 5.926518 6.915566

X7 X8 X9 X10 X11 X12

0 -11.709772 3.426596 1.462715 -2.812753 0.866538 -5.244274
1 -12.036247 3.491943 0.595441 -4.508811 2.332147 -6.221857
2 -13.847214 4.574075 -1.687559 -7.204041 -0.011847 -6.463144
3 -12.374388 6.169879 -0.544747 -6.019237 1.358559 -6.356441
4 -12.315061 4.416344 0.193500 -3.644812 2.151239 -6.816310
.. ... ... ... ... ... ...
324 -7.747027 9.738836 -11.754543 7.129909 0.209947 -1.946914
325 -6.418064 10.907098 -11.134323 6.728373 2.461446 -0.026113
326 -4.500850 11.798565 -12.031005 7.566142 -0.606010 -2.245129
327 -5.676224 12.000518 -11.912901 4.664406 1.197789 -2.230275
328 -5.799727 10.858532 -11.659845 10.605734 0.349482 -5.983281

[329 rows x 13 columns]>

We find 12 numerical features and 1 output categorical variable describing

the classes.

We find the correlation between features using the following Python script.
import seaborn as sns
plt.figure(figsize=(22,22))
ax = sns.heatmap(df.corr(), annot=True, annot_kws={"size": 20})

col_ax = plt.gcf().axes[-1]
col_ax.tick_params(labelsize=20)
plt.show()

Output:

It can be observed that a strong correlation between features exist. Next,

we explore whether classes are overlapping or not. To this end, we apply
PCA to the features to get first 2 principal components. Next, we encode
string output labels to numbers to display the scatter plot of all 6 classes.
We may type the following Python script.

from sklearn.decomposition import PCA

from sklearn import preprocessing

y = np.asarray(df.language)
#creating label Encoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
y_encoded=le.fit_transform(y)

pca = PCA(n_components=2)
X = np.asarray(df.drop('language',axis=1))
proj = pca.fit_transform(X)
plt.scatter(proj[:, 0], proj[:, 1], c=y_encoded, cmap='rainbow_r')
plt.colorbar()
plt.show()

Output:

The classes are shown in 6 distinct colors. We observe a big overlap

between classes. We extract the target language in variable y and input
features in X Numpy array.

# Extraction of target variable and features, and storing them in Numpy

arrays using asarray ( )
y = np.asarray(df.language)
X = np.asarray(df.drop('language',axis=1))

# Importing the required libraries and packages

from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
# Specifying 100 estimators in a random forest classifier
M = RandomForestClassifier(100)

# Training the RandomForestClassifier, an ensemble tree based

classifer.
M.fit(Xtrain,ytrain)
Output:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

Note that we use test_size=0.3 that means 30% of examples are assigned
to the test set. Here, we have used a random forest classifier that is an
extension of a decision tree classifier. In contrast to a decision tree-based
classifier, a random forest classifier grows multiple trees on slightly
different versions of the same dataset. This allows us to predict the output
class using all the grown trees of a random forest classifier. Each tree
votes for the class it predicts. Finally, we chose the class by the majority
vote. We have used 100 estimators, i.e., the number of trees in the
random forest classifier. Thus, for each test point, the class that get
maximum number of votes out of 100 votes is assigned to that test point.
Now we make predictions, and display the classification report.

y_pred = M.predict(Xtest)
from sklearn.metrics import classification_report
print(classification_report(ytest,y_pred,target_names=df.language.uniqu
e()))

Output:
precision recall f1-score support

ES 0.75 0.67 0.71 9

FR 1.00 0.67 0.80 12
GE 0.50 0.60 0.55 5
IT 0.43 0.60 0.50 5
UK 0.67 0.62 0.65 16
US 0.78 0.83 0.80 52

accuracy 0.74 99
macro avg 0.69 0.66 0.67 99
weighted avg 0.75 0.74 0.74 99

We observe that an accuracy of 74% is reported by the random forest

classifier. One of the main reasons of not getting accuracy close to 100%
is the presence of overlapping classes as observed in the exploratory data
analysis. The accuracy of the model can be improved if we separate the
classes as much as possible.

To draw the confusion matrix, we type the following commands.

from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest,y_pred)
sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklab
els=df.language.unique(),yticklabels=df.language.unique())
plt.xlabel("True label")
plt.ylabel("predicted label")

Output:
Text(89.18, 0.5, 'predicted label')

The entries on the diagonal of the confusion matrix indicate correct

predictions. However, there are some misclassified points as well. For
example, 4 UK accents are wrongly classified as US accents, and 4 US
accents are misclassified as UK accents. Note that we have randomly split
the dataset into training and test sets. When we run the same Python
script again, we may get slightly different results because of random
assignment of the dataset examples as training and test examples.
9.3 Face Recognition
Our third project is on Face Recognition which deals with the problem:
given the picture of a face, find the name of the person given in a training
set. For this project, we use Labeled Faces in the Wild (LFW) people
dataset. This dataset is a collection of JPEG pictures of famous people
collected on the internet; all details of this dataset are available on the
official website:
http://vis-www.cs.umass.edu/lfw/

In this dataset, each color picture is centered on a single face. Each pixel
of the color image is encoded by a float in the range 0.0 - 1.0. We import
libraries and download the dataset.

# Importing libraries and packages

from sklearn.datasets import fetch_lfw_people
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
faces = fetch_lfw_people(min_faces_per_person=50) # requires internet
connection to download data for the first time

Output:
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015

When we run this code, the dataset starts to download. It may take
sometime to download the dataset depending upon the speed of the
internet connection. We start exploring the dataset. To check the number
of rows and column of the dataset, we may type:

faces.data.shape

Output:
(1560, 2914)

There are 1560 images each having a total of 2914 pixels. To check the
shape of an individual image, we may type the following command.

faces.images[0].shape

Output:
(62, 47)

It shows that each image has a pixel grid of 62 rows and 47 columns. We
display the names of the persons whose images are present in the
dataset.

faces.target_names

Output:
array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
'Gerhard Schroeder', 'Hugo Chavez', 'Jacques Chirac',
'Jean Chretien', 'John Ashcroft', 'Junichiro Koizumi',
'Serena Williams', 'Tony Blair'], dtype='<U17')

faces.target_names.size

Output:
12

np.unique(faces.target)
Output:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype=int64)

We can also display the names of the images by typing the following
command.

faces.target_names[4]
Output:
'Gerhard Schroeder'

To show one image or many images together, we type the following

command.
plt.imshow(faces.images[0])
Output:
<matplotlib.image.AxesImage at 0x2b9c8a12588>
# Plotting multiple images together
fig , ax = plt.subplots(2,4)
for idx,axidx in enumerate(ax.flat):
axidx.imshow(faces.images[idx],cmap='bone')
axidx.set(xticks=[],yticks=[],xlabel=faces.target_names[faces.target[id
x]])

Output:

To model our dataset, we import required machine learning libraries and

packages.

# Importing machine learning support vector classifier (SVC) and PCA

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
Since neighboring pixels in any image are highly correlated, and cannot
be used directly into a machine learning algorithm, we transform our
images using principal component analysis.

# Using 150 PCA components to transform the images of the dataset.

# whiten ensures outputs with unit component-wise variances

pcaModel = PCA(n_components=150,whiten=True)

#Support vector machine (SVM) model with radial basis function (rbf)
kernel
svmModel = SVC(kernel='rbf',class_weight='balanced')
mdl = make_pipeline(pcaModel,svmModel)

# Splitting our dataset into training and test images

from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest =
train_test_split(faces.data,faces.target,test_size=0.2)

A support vector classifier uses hyper-parameters whose values affect the

prediction accuracy of the learnt classifier. These parameters have to be
estimated and their optimal values should be used for better accuracy of
the model. In scikit-learn, the parameters are passed as arguments to the
constructor of the estimator classes. It is possible and recommended to
search the hyper-parameter space for the best cross validation score.

Grid search is a technique that is used to estimate optimal value of the

hyper-parameters. Thus, we import and use GridSearchCV for best cross
validation score.

from sklearn.model_selection import GridSearchCV

param_grid =
{'svc__C':[1,5,15,30],'svc__gamma':[0.00001,0.00005,0.0001,0.005]}
grid = GridSearchCV(mdl,param_grid)

grid.fit(Xtrain,ytrain)
Output:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=Pipeline(memory=None,
steps=[('pca',
PCA(copy=True, iterated_power='auto',
n_components=150, random_state=None,
svd_solver='auto', tol=0.0,
whiten=True)),
('svc',
SVC(C=1.0, cache_size=200,
class_weight='balanced', coef0=0.0,
decision_function_shape='ovr',
degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1,
probability=False,
random_state=None, shrinking=True,
tol=0.001, verbose=False))],
verbose=False),
iid='warn', n_jobs=None,
param_grid={'svc__C': [1, 5, 15, 30],
'svc__gamma': [1e-05, 5e-05, 0.0001, 0.005]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)

print(grid.best_params_)
Output:
{'svc__C': 1, 'svc__gamma': 0.005}

To make predictions, we type the following Python script.

mdl = grid.best_estimator_
y_pred = mdl.predict(Xtest)
fig,ax = plt.subplots(5,7)
for idx , axidx in enumerate(ax.flat):
axidx.imshow(Xtest[idx].reshape(62,47),cmap='bone')
axidx.set(xticks=[],yticks=[])
axidx.set_ylabel(faces.target_names[y_pred[idx]].split()[-
1],color='green' if y_pred[idx]==ytest[idx] else 'red')
fig.suptitle('Wrong are in red',size=14)

Output:
To assess the performance of the proposed support vector classifier, we
generate the classification report as follows.

from sklearn.metrics import classification_report

print(classification_report(ytest,y_pred,target_names=faces.target_name
s))

Output:

precision recall f1-score support

Ariel Sharon 0.92 0.80 0.86 15

Colin Powell 0.59 0.90 0.71 29
Donald Rumsfeld 0.77 0.96 0.86 25
George W Bush 0.90 0.88 0.89 129
Gerhard Schroeder 0.84 0.88 0.86 24
Hugo Chavez 0.92 0.85 0.88 13
Jacques Chirac 0.83 0.50 0.62 10
Jean Chretien 0.88 0.58 0.70 12
John Ashcroft 1.00 0.90 0.95 10
Junichiro Koizumi 1.00 1.00 1.00 8
Serena Williams 0.80 0.62 0.70 13
Tony Blair 0.76 0.67 0.71 24

accuracy 0.83 312

macro avg 0.85 0.79 0.81 312
weighted avg 0.85 0.83 0.83 312
It is evident form the report that we get an accuracy score of 83%. To
check performance of the method on individual classes, we plot the
confusion matrix as follows.

# Plotting a confusion matrix

from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest,y_pred)

# heatmap with string format fmt as decimal, colorbar is off

sns.heatmap(mat.T,square=True,annot=True,fmt='d',cbar=False,xticklab
els=faces.target_names,yticklabels=faces.target_names)
plt.xlabel("True label")
plt.ylabel("predicted label")

Output:
Text(89.18, 0.5, 'predicted label')

The true and predicted labels are shown on x and y axis of the confusion
matrix, respectively. The diagonal entries on the confusion matrix
represent correct classification results. It can be observed that most
images are correctly classified by the model. However, occasional
misclassified results are shown on the off-diagonal entries of the matrix.
For example, Tony Blair is misclassified as George W Bush 9 times and
Donald Rumsfeld is wrongly predicted as George W Bush 6 times.

Gridding Report - : Data Source
No ratings yet
Gridding Report - : Data Source
5 pages
Applied Regression Analysis Final Project
No ratings yet
Applied Regression Analysis Final Project
8 pages
Regression 2
No ratings yet
Regression 2
52 pages
TOPAZ B. Ing2
100% (3)
TOPAZ B. Ing2
6 pages
Data Mining Portfolio
No ratings yet
Data Mining Portfolio
19 pages
High Quality Knitting in The Nordic Tradition Instant EPUB Download
0% (1)
High Quality Knitting in The Nordic Tradition Instant EPUB Download
15 pages
1 - Standard Linear Regression: Numpy NP Pandas
No ratings yet
1 - Standard Linear Regression: Numpy NP Pandas
4 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
47 pages
ML Assignment Presentation
No ratings yet
ML Assignment Presentation
37 pages
Sklearn Tutorial: DNN On Boston Data
No ratings yet
Sklearn Tutorial: DNN On Boston Data
9 pages
Summer03 The Labyrinth PDF
100% (1)
Summer03 The Labyrinth PDF
3 pages
BTVN1 - Colaboratory
No ratings yet
BTVN1 - Colaboratory
4 pages
Assumption of Linear Regression
No ratings yet
Assumption of Linear Regression
6 pages
POLARIS RPG - Core Rulebook 1 Beta 05 (8527262) PDF
100% (1)
POLARIS RPG - Core Rulebook 1 Beta 05 (8527262) PDF
269 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
ML Journal
No ratings yet
ML Journal
58 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
Pretest Grade 7 Chs
No ratings yet
Pretest Grade 7 Chs
4 pages
Heart Disease Prediction! ?
No ratings yet
Heart Disease Prediction! ?
52 pages
Machine Learning Group Project
No ratings yet
Machine Learning Group Project
22 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
03 Multiple Linear Regression
No ratings yet
03 Multiple Linear Regression
7 pages
PCA
No ratings yet
PCA
23 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
4.4. Data Standardization - Ipynb - Colaboratory
No ratings yet
4.4. Data Standardization - Ipynb - Colaboratory
1 page
Dal Programs With Output
No ratings yet
Dal Programs With Output
11 pages
One Hot Encoding
No ratings yet
One Hot Encoding
12 pages
Localweighted - Jupyter Notebook
No ratings yet
Localweighted - Jupyter Notebook
4 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
7 pages
Python
No ratings yet
Python
32 pages
LP Prcatical 2 Jupyter Notebook
No ratings yet
LP Prcatical 2 Jupyter Notebook
5 pages
Shiva Teja
No ratings yet
Shiva Teja
19 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
CC02 Group6 Report
No ratings yet
CC02 Group6 Report
36 pages
Pca 2382487
No ratings yet
Pca 2382487
8 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
ML Labs
No ratings yet
ML Labs
14 pages
Week 4 Naive Bayes Classifier
No ratings yet
Week 4 Naive Bayes Classifier
2 pages
AD300变频器英文说明书（V2 0）
100% (1)
AD300变频器英文说明书（V2 0）
161 pages
ML Observation
No ratings yet
ML Observation
29 pages
7 Output
No ratings yet
7 Output
4 pages
Mltheory 2
No ratings yet
Mltheory 2
14 pages
ML Project - Multi Class - Colaboratory
No ratings yet
ML Project - Multi Class - Colaboratory
7 pages
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
No ratings yet
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
4 pages
Assignment - Jupyter Notebook
No ratings yet
Assignment - Jupyter Notebook
10 pages
Chemistry Project
100% (1)
Chemistry Project
17 pages
Machine Exercise 3
No ratings yet
Machine Exercise 3
22 pages
Qmt245 Course
No ratings yet
Qmt245 Course
3 pages
Depression
No ratings yet
Depression
37 pages
Merged
No ratings yet
Merged
35 pages
List of Land Lease in TPM
No ratings yet
List of Land Lease in TPM
3 pages
Navsure N400i
No ratings yet
Navsure N400i
76 pages
Physics Inter Part 1 (Sample/Guess Paper) For Exams in 2020
No ratings yet
Physics Inter Part 1 (Sample/Guess Paper) For Exams in 2020
3 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
العين الشريرة وعلم الموت ومقالات أخرى 2
No ratings yet
العين الشريرة وعلم الموت ومقالات أخرى 2
392 pages
DA Programs
No ratings yet
DA Programs
44 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Experimenting With Data Analysis Packages and Statistical Operations
No ratings yet
Experimenting With Data Analysis Packages and Statistical Operations
18 pages
PracticalWeek03a
No ratings yet
PracticalWeek03a
1 page
GT Full Catalogue Web
No ratings yet
GT Full Catalogue Web
314 pages
L - AND - T - Project - Naveen 24cs002895
No ratings yet
L - AND - T - Project - Naveen 24cs002895
7 pages
Project 1
No ratings yet
Project 1
6 pages
Ds Pract 5 Data Analytics1 Vedanti
No ratings yet
Ds Pract 5 Data Analytics1 Vedanti
7 pages
DA Lab
No ratings yet
DA Lab
27 pages
Venus Magma Plus
No ratings yet
Venus Magma Plus
2 pages
20mia1006 FDA LAB REGRESSION TYPES
No ratings yet
20mia1006 FDA LAB REGRESSION TYPES
11 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
XPS FOAM - SquareEdge
No ratings yet
XPS FOAM - SquareEdge
4 pages
Data Science Lab Program Printout
No ratings yet
Data Science Lab Program Printout
43 pages
LEE Exam 1 Version A
No ratings yet
LEE Exam 1 Version A
7 pages
Application-Form-FSEC-for-Building-Permit Koronadal
No ratings yet
Application-Form-FSEC-for-Building-Permit Koronadal
1 page
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
2010 Golf GTD Data
No ratings yet
2010 Golf GTD Data
3 pages
Randeberg 2007
No ratings yet
Randeberg 2007
11 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
Role of Generational Equity in Environment Protection Final
No ratings yet
Role of Generational Equity in Environment Protection Final
7 pages
Project 4 - House Price Prediction - Ipynb - Colab
No ratings yet
Project 4 - House Price Prediction - Ipynb - Colab
5 pages
Forklift Battery Maintenance Guide
No ratings yet
Forklift Battery Maintenance Guide
3 pages
Frankenstein Context
No ratings yet
Frankenstein Context
1 page
WK2 Cloud Computing Presentation PDF
No ratings yet
WK2 Cloud Computing Presentation PDF
16 pages
outpot_MXout
No ratings yet
outpot_MXout
81 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
PP 11 - Bony Anatomy of The Hip
No ratings yet
PP 11 - Bony Anatomy of The Hip
14 pages
FA3629AV
No ratings yet
FA3629AV
8 pages
Stress Strain
No ratings yet
Stress Strain
17 pages
Cifras Internacionais
No ratings yet
Cifras Internacionais
17 pages
Exercise 10
No ratings yet
Exercise 10
4 pages
VLSI Module 4 & 5 Questions
No ratings yet
VLSI Module 4 & 5 Questions
2 pages

Python ML Projects

Uploaded by

Python ML Projects

Uploaded by

Data Science Projects

To give the reader a better understanding, this chapter presents 3 data

To show features and observations, we may type.

LDAPS_Tmax_lapse LDAPS_Tmin_lapse LDAPS_WS LDAPS_LH LDAPS_CC1 \

... LDAPS_PPT2 LDAPS_PPT3 LDAPS_PPT4 lat lon DEM \

Slope Solar radiation Next_Tmax Next_Tmin

[7752 rows x 24 columns]>

To convert a list or a tuple into an array we use np.asarray(). To replace

We observe that NaN values have been removed. StandardScaler from

from sklearn.preprocessing import StandardScaler

Before applying a machine learning model, let us determine the strength

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

An absolute error 1.181 and an MSE 2.325 may be acceptable depending

X7 X8 X9 X10 X11 X12

[329 rows x 13 columns]>

We find 12 numerical features and 1 output categorical variable describing

It can be observed that a strong correlation between features exist. Next,

from sklearn.decomposition import PCA

from sklearn import preprocessing

The classes are shown in 6 distinct colors. We observe a big overlap

# Extraction of target variable and features, and storing them in Numpy

# Importing the required libraries and packages

# Training the RandomForestClassifier, an ensemble tree based

ES 0.75 0.67 0.71 9

We observe that an accuracy of 74% is reported by the random forest

To draw the confusion matrix, we type the following commands.

The entries on the diagonal of the confusion matrix indicate correct

# Importing libraries and packages

To show one image or many images together, we type the following

To model our dataset, we import required machine learning libraries and

# Importing machine learning support vector classifier (SVC) and PCA

# Using 150 PCA components to transform the images of the dataset.

# whiten ensures outputs with unit component-wise variances

# Splitting our dataset into training and test images

A support vector classifier uses hyper-parameters whose values affect the

Grid search is a technique that is used to estimate optimal value of the

from sklearn.model_selection import GridSearchCV

To make predictions, we type the following Python script.

from sklearn.metrics import classification_report

precision recall f1-score support

Ariel Sharon 0.92 0.80 0.86 15

accuracy 0.83 312

# Plotting a confusion matrix

# heatmap with string format fmt as decimal, colorbar is off

You might also like