0% found this document useful (0 votes)
35 views25 pages

CS 611 Slides 4

This document discusses pollution data from the U.S. EPA. It includes information on four major pollutants and describes the dataset columns. The document then demonstrates linear regression with Python to predict nitrogen dioxide levels based on state code using the pollution data.

Uploaded by

Ahmad Abubakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views25 pages

CS 611 Slides 4

This document discusses pollution data from the U.S. EPA. It includes information on four major pollutants and describes the dataset columns. The document then demonstrates linear regression with Python to predict nitrogen dioxide levels based on state code using the pollution data.

Uploaded by

Ahmad Abubakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS 611: ARTIFICIAL

INTELLIGENCE

B. I. Ya’u, August 26, 2021


Machine Learning in Python
Linear Regression Algorithm: The Dataset
https://data.world/data-society/us-air-pollution-data/workspace/file?filename=uspollution%2Fpollution_us_2000_2016.cs

• This dataset deals with pollution in the U.S. Pollution in the U.S. has been well documented by the U.S. EPA. Includes four major pollutants (Nitrogen Dioxide,
Sulphur Dioxide, Carbon Monoxide and Ozone).

• State Code : The code allocated by US EPA to each state

• County code : The code of counties in a specific state allocated by US EPA

• Site Num : The site number in a specific county allocated by US EPA

• Address: Address of the monitoring site

• State : State of monitoring site

• County : County of monitoring site

• City : City of the monitoring site

• Date Local : Date of monitoring

• The four pollutants (NO2, O3, SO2 and O3) each has 5 specific columns. For instance, for NO2:

• NO2 Units : The units measured for NO2

• NO2 Mean : The arithmetic mean of concentration of NO2 within a given day

• NO2 AQI : The calculated air quality index of NO2 within a given day

• NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day

• NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day
Linear Regression with Python

• We will be taking the input


pollutant as the input and our
task will be to predict the
maximum value obtained for
NO2 concentration in a given
day. Let us begin by importing all
the necessary libraries:
Loading the Dataset
• dataset = pd.read_csv('/Users/
badamasiimamyau/Desktop/USPollution/
Pollution.csv')

• Let us explore the dataset a bit.



To see the shape of the dataset, run the
following command:

• dataset.shape

• This returns the following:

• (1746661, 29)

• The above output shows that our dataset


has 1746661 rows and 29 columns.


Linear Regression with Python

Dataset on a 2-D plot

• We can also plot our dataset on a 2-D plot and


see whether there are any correlations in the
dataset. Here is the code for that:

• dataset.plot(x=‘State Code', y=‘NO2


1st Max Value', style=‘o')

• plt.title('State Code vs NO2 1st Max


Value')

• plt.xlabel(‘StateCode')

• plt.ylabel('MaxValue')

• plt.show()


• The code returns the following plot:


Splitting the Dataset

• We now need to split our dataset into attributes and labels.

• Attributes are the independent variables while labels are the dependent variables whose values
will be predicted.

• We need to predict the Max Value for NO2 based on the State Code. This means that State
Code will be our x variable while NO2 1st Max Value will be our y variable.

• X = dataset[‘State Code'].values.reshape(-1,1)

• y = dataset[‘NO2 1st Max Value'].values.reshape(-1,1)

• We now need to split our dataset into 80% as train set and 20% as test set. Here is the code
for this:

• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


Training the Algorithm
• Now that we have split our dataset, it is time to train the algorithm. We create an instance of the LinearRegression class and call the fit() function
as shown below:

• regressor = LinearRegression()

• regressor.fit(X_train, y_train) #train the algorithm

• The linear regression model finds the best value for the intercept and slope, which gives a line that best fits the data. To see the value of the
intercept and slope calculated by the linear regression algorithm for our dataset, run the following code.

• # Retrieve the intercept:

• print(regressor.intercept_)

• # Retrieve the slope:

• print(regressor.coef_)

• This returns the following:

• [26.86742714]

• [[-0.06505895]]

• The output indicates that for every one unit change in State Code, the change in NO2 1st Max Value will be about -0.065%.


Dataset Description

• We can call the describe()


function to give us the statistical
details of our dataset:

• dataset.describe()


• This returns the following:


Making Predictions
• The algorithm has been trained, hence, we can use it to make predictions.

• We will do this using our test data and see how well it predicts the percentage
score.

• The following command will help you make predictions:

• y_pred = regressor.predict(X_test)

• We can now compare the actual output values for X_test with the predicted
values, by running the following script:

• df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted':


y_pred.flatten()})

print(df)
Making Predictions Cont.
• To create a bar graph that visualizes the comparison result the the following
script can be used:

• df1 = df.head(25)

• df1.plot(kind=‘bar',figsize=(16,10))

• plt.grid(which='major', linestyle='-', linewidth='0.5',


color=‘green')

• plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')

• plt.show()

• The code will return the following bar graph:


Making Predictions Cont.

• The bar graph shows a


comparison between the actual
and the predicted value.

• It shows that the predicted


percentages are very close to the
actual ones.
Making Predictions Cont.

• Now, let us create a straight line plot


showing the test data:

• plt.scatter(X_test, y_test,
color='gray')

• plt.plot(X_test, y_pred,
color='red', linewidth=2)

• plt.show()

• It returns the following:

• From the figure, we have straight line


graph, which shows that our
algorithm is ?
Evaluating the Algorithm
• To test our algorithm we will use three metrics namely:

• The Mean Absolute Error (MAE)

• The Mean Squared Error (MSE) and

• The Root Mean Squared Error (RMSE)

• We can print them using the following code:

• print('Mean Absolute Error is:', metrics.mean_absolute_error(y_test, y_pred))

• print('Mean Squared Error is:', metrics.mean_squared_error(y_test, y_pred))

• print('Root Mean Squared Error is:',

• np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

• This will return the following:

• Mean Absolute Error is: 13.275983143006018

• Mean Squared Error is: 275.6422481748691

• Root Mean Squared Error is: 16.602477169834298


K-Means Clustering Algorithm
• K-Means is one of the simplest clustering algorithms that we have.

• Clustering falls under the category of unsupervised machine learning algorithms.

• It is often applied when the data is not labelled.

• The goal of the algorithm is to identify clusters or groups within the data.

• The idea behind the clusters is that the objects contained in one cluster are more related to one another than the
objects in the other clusters.

• The similarity is a metric reflecting the strength of the relationship between two data objects.

• Clustering is highly applied in exploratory data mining.

• In has many uses in diverse fields such as pattern recognition, machine learning, information retrieval, image analysis,
data compression, bio-informatics and computer graphics.

• The algorithm forms clusters of data based on the similarity between data values.

• You are required to specify the value of K, which is the number of clusters that you expect the algorithm to make from
the data.
K-Means Clustering Algorithm Cont.

• The algorithm first selects a centroid value for every cluster. After that, it
performs three steps in an iterative manner:

• Calculate the Euclidian distance between every data instance and the
centroids for all clusters.

• Assign the instances of data to the cluster of centroid with the nearest
distance.

• Calculate the new centroid values depending on the mean values of the
coordinates of the data instances from the corresponding cluster. 

Scikit-Learn library
Importing Libraries

• Let us begin by importing all the


libraries that we will be using in
this project:

• The Pandas library will help us


read and write to spreadsheets.

• The Numpy library will help us


generate random data.

• The Matplotlib will help us


visualize our data when
necessary.
Generating Random Data
• In this example, we will be generating
random data then we use it in the code.

• The generation of the data will be done in a


two- dimensional space as shown below:

X= -2 * np.random.rand(100,2)

X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1

plt.scatter(X[ : , 0], X[ :, 1], s = 50, c =
'b')
plt.show()
• We have generated the data and plotted it in
a scatter plot.

• Note that we have generated 100 data


points. They have then been divided into two
groups each with 50 data points.
Processing the Data
• The scikit-learn library comes with
multiple functions that we can use to
process randomly generated data.

• Let us first fit a machine learning


learning model from the data.

• First, we import the KMeans model


from the library.

• The parameter n_clusters has been


given a value of 2, meaning that the
algorithm will create 2 clusters from the
data.

• The code will print the following as the


K-Means parameters after its
execution:
Finding the Cluster Centroids
• Note that our instruction was to create 2 clusters from the dataset.

• Now that they have been created, we need to see their centroids.

• Run the following code:

print(Kmean.cluster_centers_)

• The code will return the following:

[[-0.95636312 -0.89363157]

[ 2.10166026 2.04674506]]

• There are the points on the cartesian plane for the 2 cluster centroids. Let us
display them on a plot using different colors:
Finding the Cluster Centroids Cont.

plt.scatter(X[ : , 0], X[ : , 1], s =50, c='b')

plt.scatter(-0.94665068, -0.97138368,
s=200, c='g', marker='s')

plt.scatter(2.01559419, 2.02597093,
s=200, c='r', marker='s')

plt.show()

• The code returns the following plot in the


figure.

• The centroids for the two clusters are shown


in unique colors.

• Each centroid lies almost at the center of its


cluster.
Testing
• Now that we have created clusters and
seen the centroids, we need to assign
labels to the clusters.

• Each cluster will be assigned a different


label. Let us print the labels:

• print(Kmean.labels_)


• The first cluster has been assigned a


label of 0 while the second cluster has
been assigned a label of 1.

• The first cluster has 50 data points,


hence we have 50 0’s.

• This is also the case with the second


cluster, hence we have 50 1’s.
Making Predictions
• Everything about our clusters is now set, • Let us predict the cluster of another data point:

hence, we can use them to make predictions.

• sample_test=np.array([1.0,1.0])

• If we have a particular data point, it is possible


for us to know the cluster to which it belongs.
• second_test=sample_test.reshape(1,
-1)

• This is demonstrated in the following code:

• print(Kmean.predict(second_test))

• sample_test=np.array([-2.0,-2.0])

• This will return the following:

• second_test=sample_test.reshape(1, -1)
print(Kmean.predict(second_test))
• [1]

• This returns the following:


• We are checking the cluster for the point
[1.0, 1.0].

• [0]

• From the above output, we can tell that the


• The above result shows that the data point data point is assigned to the cluster with a
[-2.0, -2.0] belongs to the cluster 0.
label of 1.
End of the Slides
Thank You

You might also like