ML Lab File
ML Lab File
ML Lab File
Submitted By:
Shivam Gupta
2K16/CO/295
Computer Engineering
Vth semester
A4 Batch
INDEX
S. No. Name of Practical Date Sign
INDEX
S. No. Name of Practical Date Sign
AIM 1
To understand basic syntax of Python programming language.
THEORY
Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming
language. It was created by Guido van Rossum during 1985- 1990.
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is
designed to be highly readable. It uses English keywords frequently where as other languages use
punctuation, and it has fewer syntactical constructions than other languages.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
Python is a Beginner's Language − Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications from simple
text processing to WWW browsers to games.
SYNTAX
Printing a string
print(“Hello World!”)
Output: Hello World!
Python Indentations
Python provides no braces to indicate blocks of code for class and function definitions or
flow control. Blocks of code are denoted by line indentation, which is rigidly enforced.
Example:
if 1 < 2:
print("One is less than two!")
Output: One is less than two!
Thus, in Python all the continuous lines indented with same number of spaces would form a
block.
Multi-Line Statements
Statements in Python typically end with a new line. Python does, however, allow the use of
the line continuation character (\) to denote that the line should continue. For example –
total = item_one + \
item_two + \
item_three
Statements contained within the [], {}, or () brackets do not need to use the line
continuation character. For example −
days = ['Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday']
Comments in Python
A hash sign (#) that is not inside a string literal begins a comment. All characters after the #
and up to the end of the physical line are part of the comment and the Python interpreter
ignores them.
# First comment
print "Hello, Python!" # second comment
Docstrings
Python also has extended documentation capability, called docstrings.Docstrings can be one
line, or multiline. Python uses triple quotes at the beginning and end of the docstring:
Example:
"""This is a
multiline docstring."""
Creating Variables
Unlike other programming languages, Python has no command for declaring a variable. A
variable is created the moment you first assign a value to it. For example –
x=5
y = "John"
List
A list is a collection which is ordered and changeable. In Python lists are written with square
brackets.
thislist = ["apple", "banana", "cherry"]
Tuple
A tuple is a collection which is ordered and unchangeable. In Python tuples are written with
round brackets.
thistuple = ("apple", "banana", "cherry")
Set
A set is a collection which is unordered and unindexed. In Python sets are written with curly
brackets. Example -
thisset = {"apple", "banana", "cherry"}
Dictionary
A dictionary is a collection which is unordered, changeable and indexed. In Python
dictionaries are written with curly brackets, and they have keys and values. Example
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
DISCUSSION
Python is an interpreted high-level programming language for general-purpose programming.
Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace. It provides constructs that
enable clear programming on both small and large scales.
THEORY
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects
and a collection of routines for processing those arrays. Using NumPy, mathematical and logical
operations on arrays can be performed. It is a library consisting of multidimensional array objects
and a collection of routines for processing of array. NumPy adds support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical functions
to operate on these arrays.
CODE
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("C:/Users/Shivam gupta/DataScience-Python3/web_traffic.tsv",delimiter='\t')
data = pd.DataFrame(data)
#data = pd.read_csv("C:/Users/Shivam gupta/DataScience-
Python3/web_traffic.tsv",sep='\t',header=0)
print("First 10 rows of data are:")
data[:11]
print("Number of dimensions are: ",data.ndim)
print("Shape of data is: ",data.shape)
print("Rows which have any of the dimension as NaN:")
data[np.isnan(data[0]) | np.isnan(data[1])]
x = data[0]
y = data[1]
plt.scatter(x,y,s=12)
plt.title("Web Traffic over last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(6)],['week %i' %w for w in range(6)])
plt.grid()
plt.show()
OUTPUT
DISCUSSION
The core functionality of NumPy is its "ndarray", for n-dimensional array, data structure. These
arrays are strided views on memory. In contrast to Python's built-in list data structure (which,
despite the name, is a dynamic array), these arrays are homogeneously typed: all elements of a
single array must be of the same type. The ndarray object consists of contiguous one-dimensional
segment of computer memory, combined with an indexing scheme that maps each item to a
location in the memory block.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish
five typical steps in the processing and analysis of data, regardless of the origin of data — load,
prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.
THEORY
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to
uniquely determine the position of the points or other geometric elements on a manifold such as
Euclidean space. All spatial data is created in some coordinate system. A projection is the means by
which you display the coordinate system and your data on a flat surface, such as a piece of paper or
a digital screen. After defining the coordinate system that matches your data, you may still find you
want to use data in a different coordinate system. This is where transformations are useful.
Transformations are required to convert data between different geographic coordinate systems or
between different vertical coordinate systems.
CODE
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("C:/Users/Shivam gupta/DataScience-Python3/web_traffic.tsv",delimiter='\t')
#data = pd.read_csv("C:/Users/Shivam gupta/DataScience-
Python3/web_traffic.tsv",sep='\t',header=0)
data = pd.DataFrame(data)
data = data[~(sp.isnan(data[0]) | sp.isnan(data[1]))]
x = data[0]
y = data[1]
plt.scatter(x,y,s=12)
plt.title("Web Traffic over last month (Original Data)")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(6)],['week %i' %w for w in range(6)])
#plt.autoscale(tight=True)
plt.grid()
plt.show()
plt.scatter(0*x,y,c = 'red', s = 8)
plt.title("Data projected on Y-axis")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid()
plt.show()
theta = 45
rot_matrix = [[np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)]]
data1 = data.dot(rot_matrix)
x = data1[0]
y = data1[1]
plt.scatter(x,y,s=10, c = 'blue')
plt.title("Data represented in the xy-plane rotated counterclockwise through an angle of 45
degrees")
plt.grid()
plt.show()
We learned how to project a set of data points on new axis. In video gaming industry, matrices are
major mathematic tools to construct and manipulate a realistic animation of a polygonal figure.
Examples of matrix operations include translations, rotations, and scaling. Other matrix
transformation concepts like field of view, rendering, color transformation and projection.
AIM 4
To implement Principal Component Analysis (PCA).
THEORY
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a
large set of variables to a small set that still contains most of the information in the large set.
Principal component analysis (PCA) is a mathematical procedure that transforms a number of
(possibly) correlated variables into a (smaller) number of uncorrelated variables called principal
components.
The first principal component accounts for as much of the variability in the data as possible, and
each succeeding component accounts for as much of the remaining variability as possible.
ALGORITHM
Step 1: Normalize the data
Step 2: Calculate the covariance matrix
Step 3: Calculate the eigenvalues and eigenvectors
Step 4: Rearrange the eigenvectors and eigenvalues in decreasing order of eigenvalues
Step 5: Compute the cumulative energy content for each eigenvector
Step 6: Based on the cumulative energy content select a subset of the eigenvectors as basis vectors
Step 7: Project the data onto the new basis.
CODE
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
Y_pca = iris.target
X -= np.mean(X,axis = 0)
iris.target_names
np.cov(X.T)
#(X - np.mean(X,axis = 0)).T.dot(X - np.mean(X,axis = 0))/(X.shape[0]-1)
w,v = np.linalg.eig(np.cov(X.T))
zipped_arr = sorted(zip(w,v),reverse = True)
v = [z for _, z in zipped_arr]
v = np.array(v)
projected_matrix = X.dot(v[:,:2])
plt.scatter(projected_matrix[:,0][Y_pca == 0],projected_matrix[:,1][Y_pca == 0],
c = 'C0', label = iris.target_names[0])
plt.scatter(projected_matrix[:,0][Y_pca == 1],projected_matrix[:,1][Y_pca == 1],
c = 'C1', label = iris.target_names[1])
plt.scatter(projected_matrix[:,0][Y_pca == 2],projected_matrix[:,1][Y_pca == 2],
c = 'C2', label = iris.target_names[2])
plt.title('Iris Dataset Scatter Plot in 2D')
plt.xlabel('Feature')
plt.ylabel('Feature')
plt.legend()
plt.show()
projected_matrix = X.dot(v[:,:1])
plt.scatter(projected_matrix[:,0][Y_pca == 0],Y_pca[Y_pca == 0],
c = 'C0', label = iris.target_names[0])
plt.scatter(projected_matrix[:,0][Y_pca == 1],Y_pca[Y_pca == 1],
c = 'C1', label = iris.target_names[1])
plt.scatter(projected_matrix[:,0][Y_pca == 2],Y_pca[Y_pca == 2],
c = 'C2', label = iris.target_names[2])
plt.title('Iris Dataset Scatter Plot in 1D')
plt.xlabel('Features')
plt.ylabel('Targets')
plt.legend()
OUTPUT
DISCUSSION
PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is
often used to visualize genetic distance and relatedness between populations.
PCA reduces attribute space from a larger number of variables to a smaller number of factors and
as such is a "non-dependent" procedure (that is, it does not assume a dependent variable is
specified).
PCA is a dimensionality reduction or data compression method. The goal is dimension reduction
and there is no guarantee that the dimensions are interpretable (a fact often not appreciated by
(amateur) statisticians).
THEORY
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision tree
is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision
node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g.,
Play) represents a classification or decision.
Entropy is a measure of the amount of uncertainty in the dataset. It is defined for a binary class
with values a/b as:
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))
Information gain IG(A) is the measure of the difference in entropy from before to after the dataset
S is split on an attribute A. In other words, how much uncertainty in S was reduced after splitting
set S on attribute A.
ALGORITHM
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = v i.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the
examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes –
{A})
End
Return Root
CODE
import numpy as np
import pandas as pd
input_file = "C:/Users/Shivam gupta/DataScience-Python3/play ball.csv"
df = pd.read_csv(input_file, header = 0)
df.head(20)
features = list(df.columns[:4])
features
y = df["Play ball"]
X = df[features]
y
X
DepDict = dict()
for d in y:
DepDict[d] = 0
ans = []
def decisionTree(data, depVar, indepVar, Y):
#print(data)
if len(indepVar) ==0 or len(data) == 0:
return 1
if len(Y.value_counts().to_dict()) == 1:
return 1
root = ""
rootEnt = -np.log2(len(DepDict)) - 10
for x in indepVar:
values = dict()
for d in data[x]:
if d not in values:
values[d] = 1
else:
values[d] += 1
entropyF = 0.0
for v in values:
entropy = 0.0
for d in DepDict:
DepDict[d] = 0
cnt = 0
for i in range(0,len(data)):
if data.iloc[i][x] == v:
DepDict[data.iloc[i][depVar]] += 1
cnt += 1
for d in DepDict:
if DepDict[d] != 0 and cnt != 0:
entropy -= float(DepDict[d]/cnt) * np.log2(float(DepDict[d]/cnt))
#print(indepVar)
#print(root)
#print(rootEnt)
values = set()
for d in data[root]:
values.add(d)
for v in values:
#print("Now going to ",root," value equal to ", v)
newIndepVar = indepVar.copy()
if root in newIndepVar:
newIndepVar.remove(root)
ans.append(root + " = " + v)
flag = decisionTree(data[data[root] == v],depVar,newIndepVar, data[data[root] == v][depVar])
if flag == 1:
print('\nIf',ans,'\nThen',depVar,' : ',
data[data[root] == v][depVar].value_counts(ascending = False).index[0],'\n\n')
ans.pop(len(ans)-1)
return 0
decisionTree(df,'Play ball',features,df['Play ball'])
OUTPUT
DISCUSSION
A decision tree is a map of the possible outcomes of a series of related choices. It allows an
individual or organization to weigh possible actions against one another based on their costs,
probabilities, and benefits. They can be used either to drive informal discussion or to map out an
algorithm that predicts the best choice mathematically.
Using decision trees in machine learning has several advantages:
The cost of using the tree to predict data decreases with each additional data point
Works for either categorical or numerical data
Can model problems with multiple outputs
But they also have a few disadvantages:
When dealing with categorical data with multiple levels, the information gain is biased in
favor of the attributes with the most levels.
Calculations can become complex when dealing with uncertainty and lots of linked
outcomes.
Conjunctions between nodes are limited to AND, whereas decision graphs allow for nodes
linked by OR.
FINDINGS AND LEARNINGS
Advantage of ID3:
Understandable prediction rules are created from the training data.
Builds the fastest tree.
Builds a short tree.
Only need to test enough attributes until all data is classified.
Disadvantage of ID3:
Data may be over-fitted or over-classified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.
Difficulty in classifying continuous.
EXPERIMENT NUMBER 6
AIM
To implement Decision Tree CART (Classification And Regression Trees) Algorithm.
THEORY
Decision tree learning uses a decision tree (as a predictive model) to go from observations about an
item (represented in the branches) to conclusions about the item's target value (represented in the
leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine
learning. Tree models where the target variable can take a discrete set of values are called
classification trees; in these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels. Decision trees where the target variable can
take continuous values (typically real numbers) are called regression trees.
Gini impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
The Gini impurity can be computed by summing the probability p(i) of an item with label i being
chosen times the probability
of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall
into a single target category.
CODE
import numpy as np
import pandas as pd
input_file = "C:/Users/Shivam gupta/DataScience-Python3/decisionTreeDataset.csv"
df = pd.read_csv(input_file, header = 0)
df.head(20)
features = list(df.columns[:3])
features
y = df["Decision"]
X = df[features]
y
X
DepDict = dict()
for d in y:
DepDict[d] = 0
ans = []
def decisionTree(data, depVar, indepVar, Y):
#print(data)
if len(indepVar) ==0 or len(data) == 0:
return 1
if len(Y.value_counts().to_dict()) == 1:
return 1
root = ""
rootGini = 2
for x in indepVar:
values = dict()
for d in data[x]:
if d not in values:
values[d] = 1
else:
values[d] += 1
avgGiniF = 0.0
for v in values:
gini = 1.0
for d in DepDict:
DepDict[d] = 0
cnt = 0
for i in range(0,len(data)):
if data.iloc[i][x] == v:
DepDict[data.iloc[i][depVar]] += 1
cnt += 1
for d in DepDict:
if DepDict[d] != 0 and cnt != 0:
gini -= float(DepDict[d]/cnt) * float(DepDict[d]/cnt)
#print(indepVar)
#print(root)
#print(rootEnt)
values = set()
for d in data[root]:
values.add(d)
for v in values:
#print("Now going to ",root," value equal to ", v)
newIndepVar = indepVar.copy()
if root in newIndepVar:
newIndepVar.remove(root)
ans.append(root + " = " + v)
flag = decisionTree(data[data[root] == v],depVar,newIndepVar, data[data[root] == v][depVar])
if flag == 1:
print('\nIf',ans,'\nThen',depVar,' : ',
data[data[root] == v][depVar].value_counts(ascending = False).index[0],'\n\n')
ans.pop(len(ans)-1)
return 0
decisionTree(df,'Decision',features,df['Decision'])
OUTPUT
data.head(10)
np.isnan(data).sum()
data = data[(~np.isnan(data['x'])) & (~np.isnan(data['y']))]
plt.scatter(data['x'],data['y'],s=5,c='g')
plt.scatter(data['x'],data['y'],s=8,c='c')
plt.plot(data['x'], slope*data['x']+y_intercept, c='m')
OUTPUT
DISCUSSIONS
The most common method for fitting a regression line is the method of least-squares. This method
calculates the best-fitting line for the observed data by minimizing the sum of the squares of the
vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its
vertical deviation is 0). Because the deviations are first squared, then summed, there are no
cancellations between positive and negative values.
(X,y) = make_blobs(n_samples=50,n_features=2,centers=2,cluster_std=1.05,random_state=40)
#we need to add 1 to X values (we can say its bias)
X1 = np.c_[np.ones((X.shape[0])),X]
plt.scatter(X1[:,1],X1[:,2],marker='o',c=y)
plt.axis([-5,10,-12,-1])
plt.show()
In [3]:
postiveX=[]
negativeX=[]
for i,v in enumerate(y):
if v==0:
negativeX.append(X[i])
else:
postiveX.append(X[i])
max_feature_value=float('-inf')
min_feature_value=float('+inf')
for yi in data_dict:
if np.amax(data_dict[yi])>max_feature_value:
max_feature_value=np.amax(data_dict[yi])
if np.amin(data_dict[yi])<min_feature_value:
min_feature_value=np.amin(data_dict[yi])
b_step_size = 2
b_multiple = 5
w_optimum = max_feature_value*0.5
for lrate in learning_rate:
w = np.array([w_optimum,w_optimum])
optimized = False
while not optimized:
#b=[-maxvalue to maxvalue] we wanna maximize the b values so check for every b value
for b in np.arange(-1*(max_feature_value*b_step_size), max_feature_value*b_step_size,
lrate*b_multiple):
for transformation in transforms: # transforms = [[1,1],[-1,1],[-1,-1],[1,-1]]
w_t = w*transformation
correctly_classified = True
if correctly_classified:
length_Wvector[np.linalg.norm(w_t)] = [w_t,b] #store w, b for minimum magnitude
if w[0] < 0:
optimized = True
else:
w = w - lrate
minimum_wlength = length_Wvector[norms[0]]
w = minimum_wlength[0]
b = minimum_wlength[1]
w_optimum = w[0]+lrate*2
In [6]:
SVM_Training(data_dict)
In [7]:
colors = {1:'r',-1:'b'}
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
In [8]:
def visualize(data_dict):
plt.scatter(X1[:,1],X1[:,2],marker='o',c=y)
# hyperplane = x.w+b
# v = x.w+b
# psv = 1
# nsv = -1
# dec = 0
def hyperplane_value(x,w,b,v):
return (-w[0]*x-b+v) / w[1]
datarange = (min_feature_value*0.9,max_feature_value*1.)
hyp_x_min = datarange[0]
hyp_x_max = datarange[1]
# (w.x+b) = 1
# positive support vector hyperplane
psv1 = hyperplane_value(hyp_x_min, w, b, 1)
psv2 = hyperplane_value(hyp_x_max, w, b, 1)
ax.plot([hyp_x_min,hyp_x_max],[psv1,psv2], 'k')
# (w.x+b) = -1
# negative support vector hyperplane
nsv1 = hyperplane_value(hyp_x_min, w, b, -1)
nsv2 = hyperplane_value(hyp_x_max, w, b, -1)
ax.plot([hyp_x_min,hyp_x_max],[nsv1,nsv2], 'k')
# (w.x+b) = 0
# positive support vector hyperplane
db1 = hyperplane_value(hyp_x_min, w, b, 0)
db2 = hyperplane_value(hyp_x_max, w, b, 0)
ax.plot([hyp_x_min,hyp_x_max],[db1,db2], 'y--')
plt.axis([-5,10,-12,-1])
plt.show()
In [9]:
visualize(data_dict)
In [10]:
def predict(features):
# sign( x.w+b )
dot_result = np.sign(np.dot(np.array(features),w)+b)
return dot_result.astype(int)
for i in X[:5]:
print(predict(i),end=', ')
1, 1, -1, 1, -1,
In [11]:
l=[]
for xi in X:
l.append(predict(xi[:6]))
l=np.array(l).astype(int)
l
Out[11]:
array([ 1, 1, -1, 1, -1, -1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1,
1, -1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1])
In [12]:
X[4]
Out[12]:
array([-1.8171622 , -9.22909875])
In [13]:
for i, v in enumerate(y):
if v==0:
y[i]=-1
y
Out[13]:
array([ 1, 1, -1, 1, -1, -1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1,
1, -1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1])
In [14]:
error = sum((l-y)**2)
In [15]:
error
Out[15]:
0
DISCUSSIONS
SVMs can be used to solve various real-world problems:
SVMs are helpful in text and hypertext categorization as their application can
significantly reduce the need for labeled training instances in both the standard
inductive and transductive settings.
Classification of images can also be performed using SVMs. Experimental results
show that SVMs achieve significantly higher search accuracy than traditional query
refinement schemes after just three to four rounds of relevance feedback. This is also true
of image segmentation systems, including those using a modified version SVM that uses
the privileged approach as suggested by Vapnik.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other sciences. They have
been used to classify proteins with up to 90% of the compounds classified
correctly. Permutation tests based on SVM weights have been suggested as a
mechanism for interpretation of SVM models. Support vector machine weights have also
been used to interpret SVM models in the past. Postho interpretation of support vector
machine models in order to identify features used by the model to make
predictions is a relatively new area of research with special significance in biology.
THEORY
The K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithms. KNN
is extremely easy to implement in its most basic form, and yet performs quite complex classification
tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses
all of the data for training while classifying a new data point or instance. KNN is a non-parametric
learning algorithm, which means that it doesn't assume anything about the underlying data. This is
an extremely useful feature since most of the real world data doesn't really follow any theoretical
assumption e.g. linear-separability, uniform distribution, etc.
The intuition behind the KNN algorithm is one of the simplest of all the supervised machine learning
algorithms. It simply calculates the distance of a new data point to all other training data points.
The distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data
points, where K can be any integer. Finally it assigns the data point to the class to which the
majority of the K data points belong.
def euclid(val,pt):
d = (val[0] - pt[0])**2 + (val[1] - pt[1])**2
d = np.sqrt(d)
return d
K=4
ans = []
for val in test:
dist = []
for pt in data.values:
d = euclid(val,pt)
dist.insert(len(dist),np.append(d,pt[2]))
dist.sort(key=lambda x: x[0])
freq = {0:0, 1:0, 2:0}
for i in range(K):
freq[dist[i][1]] += 1
print(freq)
cl = sorted(freq.items(), key = lambda x: x[1], reverse = True)
ans.insert(len(ans),cl[0][0])
plt.scatter(data[0],data[1],c=data[2],s=30)
plt.scatter(test[:,0], test[:,1],c=ans,s=70, marker = "*")
plt.show()
DISCUSSIONS
KNN can be used for classification — the output is a class membership (predicts a class — a discrete
value). An object is classified by a majority vote of its neighbors, with the object being assigned to
the class most common among its k nearest neighbors. It can also be used for regression — output
is the value for the object (predicts continuous values). This value is the average (or median) of the
values of its k nearest neighbors.
Pros:
No assumptions about data — useful, for example, for nonlinear data
Simple algorithm — to explain and understand/interpret
High accuracy (relatively) — it is pretty high but not competitive in comparison to better
supervised learning models
Versatile — useful for classification or regression
Cons:
Computationally expensive — because the algorithm stores all of the training data
High memory requirement
Stores all (or almost all) of the training data
Prediction stage might be slow (with big N)
Sensitive to irrelevant features and the scale of the data
THEORY
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the
data, with the number of groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are provided. Data points are
clustered based on feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data.
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. Each centroid of a cluster is a collection of feature values
which define the resulting groups. Examining the centroid feature weights can be used to
qualitatively interpret what kind of group each cluster represents.
k-means clustering is a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. k-means clustering aims to partition n observations into
k clusters in which each observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
data[0] = data[0]/max(data[0])*100
data.head(10)
plt.scatter(data[0],data[1],c='g')
k=3
# Number of training data
n = data.shape[0]
# Number of features in the data
c = data.shape[1]
# Generate random centers, here we use sigma and mean to ensure it represent the whole data
mean = np.mean(data, axis = 0)
std = np.std(data, axis = 0)
centers = np.random.randn(k,c) * np.array(std) + np.array(mean)
centers
clusters = np.zeros(n)
distances = np.zeros((n,k))
DISCUSSIONS
The K-means clustering algorithm is used to find groups which have not been explicitly
labelled in the data. This can be used to confirm business assumptions about what types of groups
exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the
groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases
are:
Behavioural segmentation:
o Segment by purchase history
o Segment by activities on application, website, or platform
o Define personas based on interests
o Create profiles based on activity monitoring
Inventory categorization:
o Group inventory by sales activity
o Group inventory by manufacturing metrics
Sorting sensor measurements:
o Detect activity types in motion sensors
o Group images
o Separate audio
o Identify groups in health monitoring
Detecting bots or anomalies:
o Separate valid activity groups from bots
o Group valid activity to clean up outlier detection