CS 611 Slides 4
CS 611 Slides 4
INTELLIGENCE
• This dataset deals with pollution in the U.S. Pollution in the U.S. has been well documented by the U.S. EPA. Includes four major pollutants (Nitrogen Dioxide,
Sulphur Dioxide, Carbon Monoxide and Ozone).
• The four pollutants (NO2, O3, SO2 and O3) each has 5 specific columns. For instance, for NO2:
• NO2 Mean : The arithmetic mean of concentration of NO2 within a given day
• NO2 AQI : The calculated air quality index of NO2 within a given day
• NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day
• NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day
Linear Regression with Python
• dataset.shape
• (1746661, 29)
•
Linear Regression with Python
• plt.xlabel(‘StateCode')
• plt.ylabel('MaxValue')
• plt.show()
• Attributes are the independent variables while labels are the dependent variables whose values
will be predicted.
• We need to predict the Max Value for NO2 based on the State Code. This means that State
Code will be our x variable while NO2 1st Max Value will be our y variable.
• X = dataset[‘State Code'].values.reshape(-1,1)
• We now need to split our dataset into 80% as train set and 20% as test set. Here is the code
for this:
• regressor = LinearRegression()
• The linear regression model finds the best value for the intercept and slope, which gives a line that best fits the data. To see the value of the
intercept and slope calculated by the linear regression algorithm for our dataset, run the following code.
• print(regressor.intercept_)
• print(regressor.coef_)
• [26.86742714]
• [[-0.06505895]]
• The output indicates that for every one unit change in State Code, the change in NO2 1st Max Value will be about -0.065%.
•
Dataset Description
• dataset.describe()
• We will do this using our test data and see how well it predicts the percentage
score.
• y_pred = regressor.predict(X_test)
• We can now compare the actual output values for X_test with the predicted
values, by running the following script:
• df1 = df.head(25)
• df1.plot(kind=‘bar',figsize=(16,10))
• plt.show()
• plt.scatter(X_test, y_test,
color='gray')
• plt.plot(X_test, y_pred,
color='red', linewidth=2)
• plt.show()
• np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
• The goal of the algorithm is to identify clusters or groups within the data.
• The idea behind the clusters is that the objects contained in one cluster are more related to one another than the
objects in the other clusters.
• The similarity is a metric reflecting the strength of the relationship between two data objects.
• In has many uses in diverse fields such as pattern recognition, machine learning, information retrieval, image analysis,
data compression, bio-informatics and computer graphics.
• The algorithm forms clusters of data based on the similarity between data values.
• You are required to specify the value of K, which is the number of clusters that you expect the algorithm to make from
the data.
K-Means Clustering Algorithm Cont.
• The algorithm first selects a centroid value for every cluster. After that, it
performs three steps in an iterative manner:
• Calculate the Euclidian distance between every data instance and the
centroids for all clusters.
• Assign the instances of data to the cluster of centroid with the nearest
distance.
• Calculate the new centroid values depending on the mean values of the
coordinates of the data instances from the corresponding cluster.
Scikit-Learn library
Importing Libraries
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c =
'b')
plt.show()
• We have generated the data and plotted it in
a scatter plot.
• Now that they have been created, we need to see their centroids.
print(Kmean.cluster_centers_)
[[-0.95636312 -0.89363157]
[ 2.10166026 2.04674506]]
• There are the points on the cartesian plane for the 2 cluster centroids. Let us
display them on a plot using different colors:
Finding the Cluster Centroids Cont.
plt.scatter(-0.94665068, -0.97138368,
s=200, c='g', marker='s')
plt.scatter(2.01559419, 2.02597093,
s=200, c='r', marker='s')
plt.show()
• print(Kmean.labels_)
• sample_test=np.array([1.0,1.0])
• print(Kmean.predict(second_test))
• sample_test=np.array([-2.0,-2.0])
• second_test=sample_test.reshape(1, -1)
print(Kmean.predict(second_test))
• [1]
• [0]