Data Science Machine Leraning222
Data Science Machine Leraning222
MAHAMAYA POLYTECHNIC OF INFORMATION
TECHNOLOGY (AMROHA).
Submitted to
Dr Jaya Singh
For
DIPLOMA OF TECHNOLOGY
In
1
Practical-01
Q-1 Write a program in Python to implement the Decision tree Algorithm
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm
falls under the category of supervised learning algorithms. It works for both continuous as
well as categorical output variables.
There are two types of decision trees. They are categorized based on the type of the target
variable they have. If the decision tree has a categorical target variable, then it is called a
‘categorical variable decision tree’. Similarly, if it has a continuous target variable, it is called
a ‘continuous variable decision tree’.
1. # Python program to implement decision tree algorithm and plot the tree
2.
3. # Importing the required libraries
4. import pandas as pd
5. import numpy as np
6. import matplotlib.pyplot as plt
7. from sklearn import metrics
8. import seaborn as sns
9. from sklearn.datasets import load_iris
10. from sklearn.model_selection import train_test_split
11. from sklearn import tree
12.
13. # Loading the dataset
14. iris = load_iris()
15.
16. #converting the data to a pandas dataframe
17. data = pd.DataFrame(data = iris.data, columns = iris.feature_names)
18.
19. #creating a separate column for the target variable of iris dataset
20. data['Species'] = iris.target
21.
22. #replacing the categories of target variable with the actual names of the species
23. target = np.unique(iris.target)
24. target_n = np.unique(iris.target_names)
25. target_dict = dict(zip(target, target_n))
26. data['Species'] = data['Species'].replace(target_dict)
27.
28. # Separating the independent dependent variables of the dataset
29. x = data.drop(columns = "Species")
30. y = data["Species"]
31. names_features = x.columns
2
32. target_labels = y.unique()
33.
34. # Splitting the dataset into training and testing datasets
35. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 93)
36.
37. # Importing the Decision Tree classifier class from sklearn
38. from sklearn.tree import DecisionTreeClassifier
39.
40. # Creating an instance of the classifier class
41. dtc = DecisionTreeClassifier(max_depth = 3, random_state = 93)
42.
43. # Fitting the training dataset to the model
44. dtc.fit(x_train, y_train)
45.
46. # Plotting the Decision Tree
47. plt.figure(figsize = (30, 10), facecolor = 'b')
48. Tree = tree.plot_tree(dtc, feature_names = names_features, class_names = target_labels, rounded = Tr
ue, filled = True, fontsize = 14)
49. plt.show()
50. y_pred = dtc.predict(x_test)
51.
52. # Finding the confusion matrix
53. confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
54. matrix = pd.DataFrame(confusion_matrix)
55. axis = plt.axes()
56. sns.set(font_scale = 1.3)
57. plt.figure(figsize = (10,7))
58.
59. # Plotting heatmap
60. sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
61. axis.set_title('Confusion Matrix')
62. axis.set_xlabel("Predicted Values", fontsize = 10)
63. axis.set_xticklabels([''] + target_labels)
64. axis.set_ylabel( "True Labels", fontsize = 10)
65. axis.set_yticklabels(list(target_labels), rotation = 0)
66. plt.show()
3
Practical-02
Q-01 Write a program in python to implement the K-means Algorithm
K-means is an unsupervised learning method for clustering data points. The algorithm
iteratively divides data points into K clusters by minimizing the variance in each cluster
we will show you how to estimate the best value for K using the elbow method, then use K-
means clustering to group the data points into clusters.
work
First, each data point is randomly assigned to one of the K clusters. Then, we compute the
centroid (functionally the center) of each cluster, and reassign each data point to the cluster
with the closest centroid. We repeat this process until the cluster assignments for each data
point are no longer changing.
K-means clustering requires us to select K, the number of clusters we want to group the data
into. The elbow method lets us graph the inertia (a distance-based metric) and visualize the
point at which it starts decreasing linearly. This point is referred to as the "eblow" and is a
good estimate for the best value for K based on our data.
Program
Import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.show()
Output
4
import numpy as np
import matplotlib.pyplot as plt
5
Practicial-03
The term regression is used when you try to find the relationship between variables.
In Machine Learning, and in statistical modeling, that relationship is used to predict the
outcome of future events
Linear Regression
Linear regression uses the relationship between the data-points to draw a straight line through
all them.
Work
Python has methods for finding a relationship between data-points and to draw a line of linear
regression. We will show you how to use these methods instead of going through the
mathematic formula.
In the example below, the x-axis represents age, and the y-axis represents speed. We have
registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data
we collected could be used in a linear regression:
Code
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
6
Outpuput
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Output
7
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Predict the values of the target variable for the given features
predictions = model.predict(features)
8
Practical-04
Q1 Write a program in python to implement the K-NN
It is the learning where the value or result that we want to predict is within the training data
(labeled data) and the value which is in data that we want to study is known as Target or
Dependent Variable or Response Variable.
All the other columns in the dataset are known as the Feature or Predictor Variable or
Independent Variable.
Supervised Learning is classified into two categories:
1. Classification: Here our target variable consists of the categories.
2. Regression: Here our target variable is continuous and we usually try to find out the line
of the curve.
k-nearest neighbor algorithm:
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-
NN algorithm basically creates an imaginary boundary to classify the data. When new data
points come in, the algorithm will try to predict that to the nearest of the boundary line.
Therefore, larger k value means smother curves of separation resulting in less complex
models. Whereas, smaller k value tends to overfit the data and resulting in complex models.
Note: It’s very important to have the right k-value when analyzing the dataset to avoid
overfitting and underfitting of the dataset.
Using the k-nearest neighbor algorithm we fit the historical data (or train the model) and
predict the future.
1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
2. Create feature and target variables.
3. Split data into training and test data.
4. Generate a k-NN model using neighbor’s value.
5. Train or fit the data into the model.
6. Predict the future.
K is the number of nearest neighbors to use. For classification, a majority vote is used to
determined which class a new observation should fall into. Larger values of K are often more
robust to outliers and produce more stable decision boundaries than very small values
(K=3 would be better than K=1, which might produce undesirable results.
knn.fit(data, classes)
9
Code
new_x = 8
new_y = 21
new_point = [(new_x, new_y)]
prediction = knn.predict(new_point)
Output
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
10
# Choose the value of K
k = 5
11