0% found this document useful (0 votes)

35 views25 pages

CS 611 Slides 4

This document discusses pollution data from the U.S. EPA. It includes information on four major pollutants and describes the dataset columns. The document then demonstrates linear regression with Python to predict nitrogen dioxide levels based on state code using the pollution data.

Uploaded by

Ahmad Abubakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views25 pages

CS 611 Slides 4

Uploaded by

Ahmad Abubakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CS 611: ARTIFICIAL

INTELLIGENCE

B. I. Ya’u, August 26, 2021

Machine Learning in Python
Linear Regression Algorithm: The Dataset
https://data.world/data-society/us-air-pollution-data/workspace/file?filename=uspollution%2Fpollution_us_2000_2016.cs

• This dataset deals with pollution in the U.S. Pollution in the U.S. has been well documented by the U.S. EPA. Includes four major pollutants (Nitrogen Dioxide,
Sulphur Dioxide, Carbon Monoxide and Ozone).

• State Code : The code allocated by US EPA to each state

• County code : The code of counties in a specific state allocated by US EPA

• Site Num : The site number in a specific county allocated by US EPA

• Address: Address of the monitoring site

• State : State of monitoring site

• County : County of monitoring site

• City : City of the monitoring site

• Date Local : Date of monitoring

• The four pollutants (NO2, O3, SO2 and O3) each has 5 specific columns. For instance, for NO2:

• NO2 Units : The units measured for NO2

• NO2 Mean : The arithmetic mean of concentration of NO2 within a given day

• NO2 AQI : The calculated air quality index of NO2 within a given day

• NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day

• NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day
Linear Regression with Python

• We will be taking the input

pollutant as the input and our
task will be to predict the
maximum value obtained for
NO2 concentration in a given
day. Let us begin by importing all
the necessary libraries:
Loading the Dataset
• dataset = pd.read_csv('/Users/
badamasiimamyau/Desktop/USPollution/
Pollution.csv')

• Let us explore the dataset a bit. 

To see the shape of the dataset, run the
following command:

• dataset.shape

• This returns the following:

• (1746661, 29)

• The above output shows that our dataset

has 1746661 rows and 29 columns.

•
Linear Regression with Python

Dataset on a 2-D plot

• We can also plot our dataset on a 2-D plot and

see whether there are any correlations in the
dataset. Here is the code for that:

• dataset.plot(x=‘State Code', y=‘NO2

1st Max Value', style=‘o')

• plt.title('State Code vs NO2 1st Max

Value')

• plt.xlabel(‘StateCode')

• plt.ylabel('MaxValue')

• plt.show() 

• The code returns the following plot:

Splitting the Dataset

• We now need to split our dataset into attributes and labels.

• Attributes are the independent variables while labels are the dependent variables whose values
will be predicted.

• We need to predict the Max Value for NO2 based on the State Code. This means that State
Code will be our x variable while NO2 1st Max Value will be our y variable.

• X = dataset[‘State Code'].values.reshape(-1,1)

• y = dataset[‘NO2 1st Max Value'].values.reshape(-1,1)

• We now need to split our dataset into 80% as train set and 20% as test set. Here is the code
for this:

• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training the Algorithm
• Now that we have split our dataset, it is time to train the algorithm. We create an instance of the LinearRegression class and call the fit() function
as shown below:

• regressor = LinearRegression()

• regressor.fit(X_train, y_train) #train the algorithm

• The linear regression model finds the best value for the intercept and slope, which gives a line that best fits the data. To see the value of the
intercept and slope calculated by the linear regression algorithm for our dataset, run the following code.

• # Retrieve the intercept:

• print(regressor.intercept_)

• # Retrieve the slope:

• print(regressor.coef_)

• This returns the following:

• [26.86742714]

• [[-0.06505895]]

• The output indicates that for every one unit change in State Code, the change in NO2 1st Max Value will be about -0.065%.

•
Dataset Description

• We can call the describe()

function to give us the statistical
details of our dataset:

• dataset.describe() 

• This returns the following:

Making Predictions
• The algorithm has been trained, hence, we can use it to make predictions.

• We will do this using our test data and see how well it predicts the percentage
score.

• The following command will help you make predictions:

• y_pred = regressor.predict(X_test)

• We can now compare the actual output values for X_test with the predicted
values, by running the following script:

• df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted':

y_pred.flatten()}) 
print(df)
Making Predictions Cont.
• To create a bar graph that visualizes the comparison result the the following
script can be used:

• df1 = df.head(25)

• df1.plot(kind=‘bar',figsize=(16,10))

• plt.grid(which='major', linestyle='-', linewidth='0.5',

color=‘green')

• plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')

• plt.show()

• The code will return the following bar graph:

Making Predictions Cont.

• The bar graph shows a

comparison between the actual
and the predicted value.

• It shows that the predicted

percentages are very close to the
actual ones.
Making Predictions Cont.

• Now, let us create a straight line plot

showing the test data:

• plt.scatter(X_test, y_test,
color='gray')

• plt.plot(X_test, y_pred,
color='red', linewidth=2)

• plt.show()

• It returns the following:

• From the figure, we have straight line

graph, which shows that our
algorithm is ?
Evaluating the Algorithm
• To test our algorithm we will use three metrics namely:

• The Mean Absolute Error (MAE)

• The Mean Squared Error (MSE) and

• The Root Mean Squared Error (RMSE)

• We can print them using the following code:

• print('Mean Absolute Error is:', metrics.mean_absolute_error(y_test, y_pred))

• print('Mean Squared Error is:', metrics.mean_squared_error(y_test, y_pred))

• print('Root Mean Squared Error is:',

• np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

• This will return the following:

• Mean Absolute Error is: 13.275983143006018

• Mean Squared Error is: 275.6422481748691

• Root Mean Squared Error is: 16.602477169834298

K-Means Clustering Algorithm
• K-Means is one of the simplest clustering algorithms that we have.

• Clustering falls under the category of unsupervised machine learning algorithms.

• It is often applied when the data is not labelled.

• The goal of the algorithm is to identify clusters or groups within the data.

• The idea behind the clusters is that the objects contained in one cluster are more related to one another than the
objects in the other clusters.

• The similarity is a metric reflecting the strength of the relationship between two data objects.

• Clustering is highly applied in exploratory data mining.

• In has many uses in diverse fields such as pattern recognition, machine learning, information retrieval, image analysis,
data compression, bio-informatics and computer graphics.

• The algorithm forms clusters of data based on the similarity between data values.

• You are required to specify the value of K, which is the number of clusters that you expect the algorithm to make from
the data.
K-Means Clustering Algorithm Cont.

• The algorithm first selects a centroid value for every cluster. After that, it
performs three steps in an iterative manner:

• Calculate the Euclidian distance between every data instance and the
centroids for all clusters.

• Assign the instances of data to the cluster of centroid with the nearest
distance.

• Calculate the new centroid values depending on the mean values of the
coordinates of the data instances from the corresponding cluster.  
Scikit-Learn library
Importing Libraries

• Let us begin by importing all the

libraries that we will be using in
this project:

• The Pandas library will help us

read and write to spreadsheets.

• The Numpy library will help us

generate random data.

• The Matplotlib will help us

visualize our data when
necessary.
Generating Random Data
• In this example, we will be generating
random data then we use it in the code.

• The generation of the data will be done in a

two- dimensional space as shown below:

X= -2 * np.random.rand(100,2) 
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1 
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c =
'b')
plt.show()
• We have generated the data and plotted it in
a scatter plot.

• Note that we have generated 100 data

points. They have then been divided into two
groups each with 50 data points.
Processing the Data
• The scikit-learn library comes with
multiple functions that we can use to
process randomly generated data.

• Let us first fit a machine learning

learning model from the data.

• First, we import the KMeans model

from the library.

• The parameter n_clusters has been

given a value of 2, meaning that the
algorithm will create 2 clusters from the
data.

• The code will print the following as the

K-Means parameters after its
execution:
Finding the Cluster Centroids
• Note that our instruction was to create 2 clusters from the dataset.

• Now that they have been created, we need to see their centroids.

• Run the following code:

print(Kmean.cluster_centers_)

• The code will return the following:

[[-0.95636312 -0.89363157]

[ 2.10166026 2.04674506]]

• There are the points on the cartesian plane for the 2 cluster centroids. Let us
display them on a plot using diﬀerent colors:
Finding the Cluster Centroids Cont.

plt.scatter(X[ : , 0], X[ : , 1], s =50, c='b')

plt.scatter(-0.94665068, -0.97138368,
s=200, c='g', marker='s')

plt.scatter(2.01559419, 2.02597093,
s=200, c='r', marker='s')

plt.show()

• The code returns the following plot in the

figure.

• The centroids for the two clusters are shown

in unique colors.

• Each centroid lies almost at the center of its

cluster.
Testing
• Now that we have created clusters and
seen the centroids, we need to assign
labels to the clusters.

• Each cluster will be assigned a diﬀerent

label. Let us print the labels:

• print(Kmean.labels_) 

• The first cluster has been assigned a

label of 0 while the second cluster has
been assigned a label of 1.

• The first cluster has 50 data points,

hence we have 50 0’s.

• This is also the case with the second

cluster, hence we have 50 1’s.
Making Predictions
• Everything about our clusters is now set, • Let us predict the cluster of another data point:

hence, we can use them to make predictions.

• sample_test=np.array([1.0,1.0])

• If we have a particular data point, it is possible

for us to know the cluster to which it belongs.
• second_test=sample_test.reshape(1,
-1)

• This is demonstrated in the following code:

• print(Kmean.predict(second_test))

• sample_test=np.array([-2.0,-2.0])

• This will return the following:

• second_test=sample_test.reshape(1, -1)
print(Kmean.predict(second_test))
• [1]

• This returns the following:

• We are checking the cluster for the point
[1.0, 1.0].

• [0]

• From the above output, we can tell that the

• The above result shows that the data point data point is assigned to the cluster with a
[-2.0, -2.0] belongs to the cluster 0.
label of 1.
End of the Slides
Thank You

Machine Learning Mathematics in Python - Jamie Flux - 2024
No ratings yet
Machine Learning Mathematics in Python - Jamie Flux - 2024
238 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
MLCyber Lab
No ratings yet
MLCyber Lab
9 pages
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Machine Learning
100% (5)
Machine Learning
56 pages
ML Combined
No ratings yet
ML Combined
254 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Record
No ratings yet
Record
23 pages
Practical (Data Science)
No ratings yet
Practical (Data Science)
13 pages
Train
No ratings yet
Train
17 pages
ML Ass
No ratings yet
ML Ass
16 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Machine Learning With Python Algorithms
No ratings yet
Machine Learning With Python Algorithms
28 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Python Learning
No ratings yet
Python Learning
21 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Practise Questions
No ratings yet
Practise Questions
26 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
ML Lab
No ratings yet
ML Lab
23 pages
ML Lab Manual
No ratings yet
ML Lab Manual
36 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
ML Record
No ratings yet
ML Record
23 pages
ML in Python Part-2
No ratings yet
ML in Python Part-2
21 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Module 2
No ratings yet
Module 2
20 pages
ML Solution
No ratings yet
ML Solution
60 pages
ML (Sudhanshu)
No ratings yet
ML (Sudhanshu)
24 pages
PracticalWeek03a
No ratings yet
PracticalWeek03a
1 page
Zerox Ready
No ratings yet
Zerox Ready
21 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
Petrol Assignment
No ratings yet
Petrol Assignment
5 pages
MLA Manual
No ratings yet
MLA Manual
25 pages
Cp4252-Machine Learning Lab Manual 23-24
No ratings yet
Cp4252-Machine Learning Lab Manual 23-24
28 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
6 Real-World Case Studies: Data Science For Business
No ratings yet
6 Real-World Case Studies: Data Science For Business
18 pages
LAB MANUAL For Machine Learning
No ratings yet
LAB MANUAL For Machine Learning
15 pages
ML Lab Record - 250625 - 105014
No ratings yet
ML Lab Record - 250625 - 105014
29 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
ML Lab Manual
No ratings yet
ML Lab Manual
13 pages
ML Lab Manual1
No ratings yet
ML Lab Manual1
23 pages
ML Record
No ratings yet
ML Record
19 pages
ML Lap
No ratings yet
ML Lap
23 pages