ML Lab File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53
At a glance
Powered by AI
The document discusses the k-means clustering algorithm, including its syntax, applications, advantages, disadvantages and implementation on sample data.

The k-means algorithm has advantages such as being fast, robust and efficient. However, it requires specifying the number of clusters beforehand and can be impacted by outliers. It also works best on distinct, well-separated data.

The k-means algorithm can be used for applications such as behavioral segmentation, inventory categorization, sorting sensor measurements and detecting bots/anomalies.

Delhi Technological University

Machine Learning Lab

Submitted By:
Shivam Gupta
2K16/CO/295
Computer Engineering
Vth semester
A4 Batch
INDEX
S. No. Name of Practical Date Sign
INDEX
S. No. Name of Practical Date Sign
AIM 1
To understand basic syntax of Python programming language.

THEORY
Python is a general-purpose interpreted, interactive, object-oriented, and high-level programming
language. It was created by Guido van Rossum during 1985- 1990.
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is
designed to be highly readable. It uses English keywords frequently where as other languages use
punctuation, and it has fewer syntactical constructions than other languages.
 Python is Interpreted − Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
 Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
 Python is Object-Oriented − Python supports Object-Oriented style or technique of
programming that encapsulates code within objects.
 Python is a Beginner's Language − Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications from simple
text processing to WWW browsers to games.

SYNTAX
 Printing a string
print(“Hello World!”)
Output: Hello World!
 Python Indentations
Python provides no braces to indicate blocks of code for class and function definitions or
flow control. Blocks of code are denoted by line indentation, which is rigidly enforced.
Example:
if 1 < 2:
print("One is less than two!")
Output: One is less than two!
Thus, in Python all the continuous lines indented with same number of spaces would form a
block.
 Multi-Line Statements
Statements in Python typically end with a new line. Python does, however, allow the use of
the line continuation character (\) to denote that the line should continue. For example –
total = item_one + \
item_two + \
item_three
Statements contained within the [], {}, or () brackets do not need to use the line
continuation character. For example −
days = ['Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday']

 Comments in Python
A hash sign (#) that is not inside a string literal begins a comment. All characters after the #
and up to the end of the physical line are part of the comment and the Python interpreter
ignores them.
# First comment
print "Hello, Python!" # second comment

 Docstrings

Python also has extended documentation capability, called docstrings.Docstrings can be one
line, or multiline. Python uses triple quotes at the beginning and end of the docstring:
Example:

"""This is a
multiline docstring."""

 Creating Variables
Unlike other programming languages, Python has no command for declaring a variable. A
variable is created the moment you first assign a value to it. For example –
x=5
y = "John"

 List
A list is a collection which is ordered and changeable. In Python lists are written with square
brackets.
thislist = ["apple", "banana", "cherry"]

 Tuple
A tuple is a collection which is ordered and unchangeable. In Python tuples are written with
round brackets.
thistuple = ("apple", "banana", "cherry")

 Set
A set is a collection which is unordered and unindexed. In Python sets are written with curly
brackets. Example -
thisset = {"apple", "banana", "cherry"}

 Dictionary
A dictionary is a collection which is unordered, changeable and indexed. In Python
dictionaries are written with curly brackets, and they have keys and values. Example
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}

 Python For Loops


A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a
set, or a string).
With the for loop we can execute a set of statements, once for each item in a list, tuple, set
etc. Example -
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
 Functions
A function is a block of code which only runs when it is called.You can pass data, known as
parameters, into a function.A function can return data as a result. Example
def my_function():
print("Hello from a function")
 Python Lambda
A lambda function is a small anonymous function. A lambda function can take any number
of arguments, but can only have one expression.
x = lambda a : a + 10
print(x(5))

DISCUSSION
Python is an interpreted high-level programming language for general-purpose programming.
Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace. It provides constructs that
enable clear programming on both small and large scales.

FINDINGS AND LEARNINGS


Python has a big list of good features, few are listed below −
 It supports functional and structured programming methods as well as OOP.
 It can be used as a scripting language or can be compiled to byte-code for building large
applications.
 It provides very high-level dynamic data types and supports dynamic type checking.
 It supports automatic garbage collection.
 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
AIM 2
To learn and implement data representation in python.

THEORY
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects
and a collection of routines for processing those arrays. Using NumPy, mathematical and logical
operations on arrays can be performed. It is a library consisting of multidimensional array objects
and a collection of routines for processing of array. NumPy adds support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical functions
to operate on these arrays.

CODE
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("C:/Users/Shivam gupta/DataScience-Python3/web_traffic.tsv",delimiter='\t')
data = pd.DataFrame(data)
#data = pd.read_csv("C:/Users/Shivam gupta/DataScience-
Python3/web_traffic.tsv",sep='\t',header=0)
print("First 10 rows of data are:")
data[:11]
print("Number of dimensions are: ",data.ndim)
print("Shape of data is: ",data.shape)
print("Rows which have any of the dimension as NaN:")
data[np.isnan(data[0]) | np.isnan(data[1])]
x = data[0]
y = data[1]
plt.scatter(x,y,s=12)
plt.title("Web Traffic over last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(6)],['week %i' %w for w in range(6)])
plt.grid()
plt.show()
OUTPUT
DISCUSSION
The core functionality of NumPy is its "ndarray", for n-dimensional array, data structure. These
arrays are strided views on memory. In contrast to Python's built-in list data structure (which,
despite the name, is a dynamic array), these arrays are homogeneously typed: all elements of a
single array must be of the same type. The ndarray object consists of contiguous one-dimensional
segment of computer memory, combined with an indexing scheme that maps each item to a
location in the memory block.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish
five typical steps in the processing and analysis of data, regardless of the origin of data — load,
prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.

FINDINGS AND LEARNINGS


NumPy is often used along with packages like SciPy (Scientific Python) and Mat−plotlib (plotting
library). Using NumPy, a developer can perform the following operations −
 Mathematical and logical operations on arrays.
 Fourier transforms and routines for shape manipulation.
 Operations related to linear algebra. NumPy has in-built functions for linear algebra and
random number generation.
AIM 3
To learn and implement data projection.

THEORY
In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to
uniquely determine the position of the points or other geometric elements on a manifold such as
Euclidean space. All spatial data is created in some coordinate system. A projection is the means by
which you display the coordinate system and your data on a flat surface, such as a piece of paper or
a digital screen. After defining the coordinate system that matches your data, you may still find you
want to use data in a different coordinate system. This is where transformations are useful.
Transformations are required to convert data between different geographic coordinate systems or
between different vertical coordinate systems.
CODE
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
data = sp.genfromtxt("C:/Users/Shivam gupta/DataScience-Python3/web_traffic.tsv",delimiter='\t')
#data = pd.read_csv("C:/Users/Shivam gupta/DataScience-
Python3/web_traffic.tsv",sep='\t',header=0)
data = pd.DataFrame(data)
data = data[~(sp.isnan(data[0]) | sp.isnan(data[1]))]
x = data[0]
y = data[1]

plt.scatter(x,y,s=12)
plt.title("Web Traffic over last month (Original Data)")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(6)],['week %i' %w for w in range(6)])
#plt.autoscale(tight=True)
plt.grid()
plt.show()

plt.scatter(x,0*y,c = 'purple', s = 10)


plt.title("Data projected on X-axis")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid()
plt.show()

plt.scatter(0*x,y,c = 'red', s = 8)
plt.title("Data projected on Y-axis")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid()
plt.show()

theta = 45
rot_matrix = [[np.cos(theta),-np.sin(theta)],[np.sin(theta),np.cos(theta)]]
data1 = data.dot(rot_matrix)
x = data1[0]
y = data1[1]
plt.scatter(x,y,s=10, c = 'blue')
plt.title("Data represented in the xy-plane rotated counterclockwise through an angle of 45
degrees")
plt.grid()
plt.show()

v = np.array([[1,1]]) #Vector Coordinates


data1 = np.matmul(data,v.T.dot(v) / (v.dot(v.T)))
x = data1[:,0]
y = data1[:,1]
plt.scatter(x,y,s=12)
plt.title("Data projected on x=y line")
plt.grid()
plt.show()
OUTPUT
DISCUSSION
Numeric data can be viewed as a vector. It can be projected to a choice of axis for better
visualization, performance, scalability etc. It is possible to recast a matrix along other axes. The
eigenvectors of a matrix can serve as the foundation of a new set of coordinates for the same
matrix where the vector with highest eigen value represents the data effectively.
Here the web traffic data has been projected along various vectors. The projection along X (Y) axis is
nothing but the X (Y) component of the individual data points.

FINDINGS AND LEARNINGS

We learned how to project a set of data points on new axis. In video gaming industry, matrices are
major mathematic tools to construct and manipulate a realistic animation of a polygonal figure.
Examples of matrix operations include translations, rotations, and scaling. Other matrix
transformation concepts like field of view, rendering, color transformation and projection.
AIM 4
To implement Principal Component Analysis (PCA).

THEORY
Principal Component Analysis (PCA) is a dimension-reduction tool that can be used to reduce a
large set of variables to a small set that still contains most of the information in the large set.
Principal component analysis (PCA) is a mathematical procedure that transforms a number of
(possibly) correlated variables into a (smaller) number of uncorrelated variables called principal
components.
The first principal component accounts for as much of the variability in the data as possible, and
each succeeding component accounts for as much of the remaining variability as possible.
ALGORITHM
Step 1: Normalize the data
Step 2: Calculate the covariance matrix
Step 3: Calculate the eigenvalues and eigenvectors
Step 4: Rearrange the eigenvectors and eigenvalues in decreasing order of eigenvalues
Step 5: Compute the cumulative energy content for each eigenvector
Step 6: Based on the cumulative energy content select a subset of the eigenvectors as basis vectors
Step 7: Project the data onto the new basis.
CODE
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
Y_pca = iris.target
X -= np.mean(X,axis = 0)
iris.target_names
np.cov(X.T)
#(X - np.mean(X,axis = 0)).T.dot(X - np.mean(X,axis = 0))/(X.shape[0]-1)
w,v = np.linalg.eig(np.cov(X.T))
zipped_arr = sorted(zip(w,v),reverse = True)

v = [z for _, z in zipped_arr]
v = np.array(v)

projected_matrix = X.dot(v[:,:2])
plt.scatter(projected_matrix[:,0][Y_pca == 0],projected_matrix[:,1][Y_pca == 0],
c = 'C0', label = iris.target_names[0])
plt.scatter(projected_matrix[:,0][Y_pca == 1],projected_matrix[:,1][Y_pca == 1],
c = 'C1', label = iris.target_names[1])
plt.scatter(projected_matrix[:,0][Y_pca == 2],projected_matrix[:,1][Y_pca == 2],
c = 'C2', label = iris.target_names[2])
plt.title('Iris Dataset Scatter Plot in 2D')
plt.xlabel('Feature')
plt.ylabel('Feature')
plt.legend()
plt.show()

projected_matrix = X.dot(v[:,:1])
plt.scatter(projected_matrix[:,0][Y_pca == 0],Y_pca[Y_pca == 0],
c = 'C0', label = iris.target_names[0])
plt.scatter(projected_matrix[:,0][Y_pca == 1],Y_pca[Y_pca == 1],
c = 'C1', label = iris.target_names[1])
plt.scatter(projected_matrix[:,0][Y_pca == 2],Y_pca[Y_pca == 2],
c = 'C2', label = iris.target_names[2])
plt.title('Iris Dataset Scatter Plot in 1D')
plt.xlabel('Features')
plt.ylabel('Targets')
plt.legend()

OUTPUT
DISCUSSION
PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is
often used to visualize genetic distance and relatedness between populations.
PCA reduces attribute space from a larger number of variables to a smaller number of factors and
as such is a "non-dependent" procedure (that is, it does not assume a dependent variable is
specified).
PCA is a dimensionality reduction or data compression method. The goal is dimension reduction
and there is no guarantee that the dimensions are interpretable (a fact often not appreciated by
(amateur) statisticians).

FINDINGS AND LEARNINGS


In quantitative finance, principal component analysis can be directly applied to the risk
management of interest rate derivatives portfolios.
A variant of principal components analysis is used in neuroscience to identify the specific properties
of a stimulus that increase a neuron's probability of generating an action potential.
PCA is predominantly used as a dimensionality reduction technique in domains like facial
recognition, computer vision and image compression. It is also used for finding patterns in data of
high dimension in the field of finance, data mining, bioinformatics, psychology, etc.
If the data is not linearly correlated (for example in spiral), PCA is not enough.
EXPERIMENT NUMBER 5
AIM
To implement Decision Tree ID3 (Iterative Dichotomiser 3) Algorithm.

THEORY
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision tree
is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision
node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g.,
Play) represents a classification or decision.
Entropy is a measure of the amount of uncertainty in the dataset. It is defined for a binary class
with values a/b as:
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))
Information gain IG(A) is the measure of the difference in entropy from before to after the dataset
S is split on an attribute A. In other words, how much uncertainty in S was reduced after splitting
set S on attribute A.
ALGORITHM
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = v i.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the
examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes –
{A})
End
Return Root
CODE
import numpy as np
import pandas as pd
input_file = "C:/Users/Shivam gupta/DataScience-Python3/play ball.csv"
df = pd.read_csv(input_file, header = 0)
df.head(20)
features = list(df.columns[:4])
features
y = df["Play ball"]
X = df[features]
y
X
DepDict = dict()
for d in y:
DepDict[d] = 0
ans = []
def decisionTree(data, depVar, indepVar, Y):
#print(data)
if len(indepVar) ==0 or len(data) == 0:
return 1
if len(Y.value_counts().to_dict()) == 1:
return 1
root = ""
rootEnt = -np.log2(len(DepDict)) - 10
for x in indepVar:
values = dict()
for d in data[x]:
if d not in values:
values[d] = 1
else:
values[d] += 1

entropyF = 0.0
for v in values:
entropy = 0.0
for d in DepDict:
DepDict[d] = 0
cnt = 0
for i in range(0,len(data)):
if data.iloc[i][x] == v:
DepDict[data.iloc[i][depVar]] += 1
cnt += 1

for d in DepDict:
if DepDict[d] != 0 and cnt != 0:
entropy -= float(DepDict[d]/cnt) * np.log2(float(DepDict[d]/cnt))

entropyF -= float(values[v]/len(data)) * entropy

if entropyF > rootEnt:


rootEnt = entropyF
root = x

#print(indepVar)
#print(root)
#print(rootEnt)
values = set()
for d in data[root]:
values.add(d)
for v in values:
#print("Now going to ",root," value equal to ", v)
newIndepVar = indepVar.copy()
if root in newIndepVar:
newIndepVar.remove(root)
ans.append(root + " = " + v)
flag = decisionTree(data[data[root] == v],depVar,newIndepVar, data[data[root] == v][depVar])
if flag == 1:
print('\nIf',ans,'\nThen',depVar,' : ',
data[data[root] == v][depVar].value_counts(ascending = False).index[0],'\n\n')
ans.pop(len(ans)-1)
return 0
decisionTree(df,'Play ball',features,df['Play ball'])

OUTPUT
DISCUSSION
A decision tree is a map of the possible outcomes of a series of related choices. It allows an
individual or organization to weigh possible actions against one another based on their costs,
probabilities, and benefits. They can be used either to drive informal discussion or to map out an
algorithm that predicts the best choice mathematically.
Using decision trees in machine learning has several advantages:
 The cost of using the tree to predict data decreases with each additional data point
 Works for either categorical or numerical data
 Can model problems with multiple outputs
But they also have a few disadvantages:
 When dealing with categorical data with multiple levels, the information gain is biased in
favor of the attributes with the most levels.
 Calculations can become complex when dealing with uncertainty and lots of linked
outcomes.
 Conjunctions between nodes are limited to AND, whereas decision graphs allow for nodes
linked by OR.
FINDINGS AND LEARNINGS
Advantage of ID3:
 Understandable prediction rules are created from the training data.
 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until all data is classified.
Disadvantage of ID3:
 Data may be over-fitted or over-classified, if a small sample is tested.
 Only one attribute at a time is tested for making a decision.
 Difficulty in classifying continuous.
EXPERIMENT NUMBER 6
AIM
To implement Decision Tree CART (Classification And Regression Trees) Algorithm.
THEORY
Decision tree learning uses a decision tree (as a predictive model) to go from observations about an
item (represented in the branches) to conclusions about the item's target value (represented in the
leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine
learning. Tree models where the target variable can take a discrete set of values are called
classification trees; in these tree structures, leaves represent class labels and branches represent
conjunctions of features that lead to those class labels. Decision trees where the target variable can
take continuous values (typically real numbers) are called regression trees.
Gini impurity is a measure of how often a randomly chosen element from the set would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.
The Gini impurity can be computed by summing the probability p(i) of an item with label i being
chosen times the probability

of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall
into a single target category.

CODE
import numpy as np
import pandas as pd
input_file = "C:/Users/Shivam gupta/DataScience-Python3/decisionTreeDataset.csv"
df = pd.read_csv(input_file, header = 0)
df.head(20)
features = list(df.columns[:3])
features
y = df["Decision"]
X = df[features]
y
X
DepDict = dict()
for d in y:
DepDict[d] = 0
ans = []
def decisionTree(data, depVar, indepVar, Y):
#print(data)
if len(indepVar) ==0 or len(data) == 0:
return 1
if len(Y.value_counts().to_dict()) == 1:
return 1
root = ""
rootGini = 2
for x in indepVar:
values = dict()
for d in data[x]:
if d not in values:
values[d] = 1
else:
values[d] += 1

avgGiniF = 0.0
for v in values:
gini = 1.0
for d in DepDict:
DepDict[d] = 0
cnt = 0
for i in range(0,len(data)):
if data.iloc[i][x] == v:
DepDict[data.iloc[i][depVar]] += 1
cnt += 1

for d in DepDict:
if DepDict[d] != 0 and cnt != 0:
gini -= float(DepDict[d]/cnt) * float(DepDict[d]/cnt)

avgGiniF += float(values[v]/len(data)) * gini


if avgGiniF < rootGini:
rootGini = avgGiniF
root = x

#print(indepVar)
#print(root)
#print(rootEnt)
values = set()
for d in data[root]:
values.add(d)
for v in values:
#print("Now going to ",root," value equal to ", v)
newIndepVar = indepVar.copy()
if root in newIndepVar:
newIndepVar.remove(root)
ans.append(root + " = " + v)
flag = decisionTree(data[data[root] == v],depVar,newIndepVar, data[data[root] == v][depVar])
if flag == 1:
print('\nIf',ans,'\nThen',depVar,' : ',
data[data[root] == v][depVar].value_counts(ascending = False).index[0],'\n\n')
ans.pop(len(ans)-1)

return 0
decisionTree(df,'Decision',features,df['Decision'])
OUTPUT

Dependent Variable “Decision” values:


Independent Variables:

Rules of the Decision Tree using CART:


DISCUSSIONS
An important feature of CART is its ability to generate regression trees. Regression trees are trees
where their leaves predict a real number and not a class. In case of regression, CART looks for splits
that minimize the prediction squared error (the least–squared deviation). The prediction in each
leaf is based on the weighted mean for node.
When provided, CART can consider misclassification costs in the tree induction.
The main elements of CART (and any decision tree algorithm) are:
1. Rules for splitting data at a node based on the value of one variable;
2. Stopping rules for deciding when a branch is terminal and can be split no more; and
3. Finally, a prediction for the target variable in each terminal node.

FINDINGS AND LEARNINGS


Some useful features and advantages of CART are:
 CART is nonparametric and therefore does not rely on data belonging to a particular type of
distribution.
 CART is not significantly impacted by outliers in the input variables.
 You can relax stopping rules to "overgrow" decision trees and then prune back the tree to
the optimal size. This approach minimizes the probability that important structure in the
data set will be overlooked by stopping too soon.
 CART incorporates both testing with a test data set and cross-validation to assess the
goodness of fit more accurately.
 CART can use the same variables more than once in different parts of the tree. This
capability can uncover complex interdependencies between sets of variables.
 CART can be used in conjunction with other prediction methods to select the input set of
variables.
Some disadvantages are:
 CART may have unstable decision tree. Insignificant modification of learning sample such as
eliminating several observations and cause changes in decision tree: increase or decrease of
tree complexity, changes in splitting variables and values.
 CART splits only by one variable.
EXPERIMENT NUMBER 7
AIM
To implement Linear Regression.
THEORY
In statistics, linear regression is a linear approach to modelling the relationship between a scalar
response (or dependent variable) and one or more explanatory variables (or independent
variables). The case of one explanatory variable is called simple linear regression. For more than
one explanatory variable, the process is called multiple linear regression. This term is distinct from
multivariate linear regression, where multiple correlated dependent variables are predicted, rather
than a single scalar variable.
In linear regression, the relationships are modeled using linear predictor functions whose unknown
model parameters are estimated from the data. Such models are called linear models. Most
commonly, the conditional mean of the response given the values of the explanatory variables (or
predictors) is assumed to be an affine function of those values; less commonly, the conditional
median or some other quantile is used. Like all forms of regression analysis, linear regression
focuses on the conditional probability distribution of the response given the values of the
predictors, rather than on the joint probability distribution of all of these variables, which is the
domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used
extensively in practical applications. This is because models which depend linearly on their
unknown parameters are easier to fit than models which are non-linearly related to their
parameters and because the statistical properties of the resulting estimators are easier to
determine.
CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("C:/Users/Shivam gupta/DataScience-Python3/Linear Regression Kaggle/train.cs
v",header=0)

data.head(10)
np.isnan(data).sum()
data = data[(~np.isnan(data['x'])) & (~np.isnan(data['y']))]
plt.scatter(data['x'],data['y'],s=5,c='g')

def linear_regression(X, y, m_current=0, b_current=0, epochs=1000, learning_rate=0.0001):


N = float(len(y))
for i in range(epochs):
y_current = (m_current * X) + b_current
if i == epochs-1:
cost = sum([dat**2 for dat in (y-y_current)]) / N
m_gradient = -(2/N) * sum(X * (y - y_current))
b_gradient = -(2/N) * sum(y - y_current)
m_current = m_current - (learning_rate * m_gradient)
b_current = b_current - (learning_rate * b_gradient)
return m_current, b_current, cost

slope, y_intercept, error = linear_regression(data['x'],data['y'])

print("The line is : y = ",slope," * x + ",y_intercept,"\nThe error is : ",error)

plt.scatter(data['x'],data['y'],s=8,c='c')
plt.plot(data['x'], slope*data['x']+y_intercept, c='m')

OUTPUT
DISCUSSIONS
The most common method for fitting a regression line is the method of least-squares. This method
calculates the best-fitting line for the observed data by minimizing the sum of the squares of the
vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its
vertical deviation is 0). Because the deviations are first squared, then summed, there are no
cancellations between positive and negative values.

FINDINGS AND LEARNINGS


Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function that
predicts the response value(y) as accurately as possible as a function of the feature or independent
variable(x).
Applications:
1. Trend lines: A trend line represents the variation in some quantitative data with passage of time
(like GDP, oil prices, etc.). These trends usually follow a linear relationship. Hence, linear regression
can be applied to predict future values. However, this method suffers from a lack of scientific
validity in cases where other potential changes can affect the data.
2. Economics: Linear regression is the predominant empirical tool in economics. For example, it is
used to predict consumption spending, fixed investment spending, inventory investment, purchases
of a country’s exports, spending on imports, the demand to hold liquid assets, labor demand, and
labor supply.
3. Finance: Capital price asset model uses linear regression to analyze and quantify the systematic
risks of an investment.
4. Biology: Linear regression is used to model causal relationships between parameters in biological
systems.
EXPERIMENT NUMBER 8
AIM
To implement Support Vector Machine (SVM) in Python.
THEORY
In machine learning, support vector machines (SVMs, also support vector networks) are supervised
learning models with associated learning algorithms that analyze data used for classification and
regression analysis. Given a set of training examples, each marked as belonging to one or the other
of two categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier (although methods such
as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a
representation of the examples as points in space, mapped so that the examples of the separate
categories are divided by a clear gap that is as wide as possible. New examples are then mapped
into that same space and predicted to belong to a category based on which side of the gap they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear


classification using what is called the kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces.
When data is unlabelled, supervised learning is not possible, and an unsupervised learning
approach is required, which attempts to find natural clustering of the data to groups, and then map
new data to these formed groups. The support vector clustering algorithm, created by Hava
Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support
vector machines algorithm, to categorize unlabeled data, and is one of the most widely used
clustering algorithms in industrial applications.
More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or
infinite-dimensional space, which can be used for classification, regression, or other tasks like
outliers detection. Intuitively, a good separation is achieved by the hyperplane that has the largest
distance to the nearest training-data point of any class (so-called functional margin), since in
general the larger the margin the lower the generalization error of the classifier.
CODE and OUTPUT
In [1]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(6)
import math
In [2]:
from sklearn.datasets.samples_generator import make_blobs

(X,y) = make_blobs(n_samples=50,n_features=2,centers=2,cluster_std=1.05,random_state=40)
#we need to add 1 to X values (we can say its bias)
X1 = np.c_[np.ones((X.shape[0])),X]

plt.scatter(X1[:,1],X1[:,2],marker='o',c=y)
plt.axis([-5,10,-12,-1])
plt.show()

In [3]:
postiveX=[]
negativeX=[]
for i,v in enumerate(y):
if v==0:
negativeX.append(X[i])
else:
postiveX.append(X[i])

#our data dictionary


data_dict = {-1:np.array(negativeX), 1:np.array(postiveX)}
In [4]:
#all the required variables
w=[] #weights 2 dimensional vector
b=[] #bias

max_feature_value=float('-inf')
min_feature_value=float('+inf')

for yi in data_dict:
if np.amax(data_dict[yi])>max_feature_value:
max_feature_value=np.amax(data_dict[yi])

if np.amin(data_dict[yi])<min_feature_value:
min_feature_value=np.amin(data_dict[yi])

learning_rate = [max_feature_value * 0.1, max_feature_value * 0.01, max_feature_value * 0.001,]


In [5]:
def SVM_Training(data_dict):
i=1
global w
global b
# { ||w||: [w,b] }
length_Wvector = {}
transforms = [[1,1],[-1,1],[-1,-1],[1,-1]]

b_step_size = 2
b_multiple = 5
w_optimum = max_feature_value*0.5
for lrate in learning_rate:

w = np.array([w_optimum,w_optimum])
optimized = False
while not optimized:
#b=[-maxvalue to maxvalue] we wanna maximize the b values so check for every b value
for b in np.arange(-1*(max_feature_value*b_step_size), max_feature_value*b_step_size,
lrate*b_multiple):
for transformation in transforms: # transforms = [[1,1],[-1,1],[-1,-1],[1,-1]]
w_t = w*transformation

correctly_classified = True

# every data point should be correct


for yi in data_dict:
for xi in data_dict[yi]:
if yi*(np.dot(w_t,xi)+b) < 1: # we want yi*(np.dot(w_t,xi)+b) >= 1 for correct
classification
correctly_classified = False

if correctly_classified:
length_Wvector[np.linalg.norm(w_t)] = [w_t,b] #store w, b for minimum magnitude

if w[0] < 0:
optimized = True
else:
w = w - lrate

norms = sorted([n for n in length_Wvector])

minimum_wlength = length_Wvector[norms[0]]
w = minimum_wlength[0]
b = minimum_wlength[1]

w_optimum = w[0]+lrate*2
In [6]:
SVM_Training(data_dict)
In [7]:
colors = {1:'r',-1:'b'}
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
In [8]:
def visualize(data_dict):

#[[ax.scatter(x[0],x[1],s=100,color=colors[i]) for x in data_dict[i]] for i in data_dict]

plt.scatter(X1[:,1],X1[:,2],marker='o',c=y)

# hyperplane = x.w+b
# v = x.w+b
# psv = 1
# nsv = -1
# dec = 0
def hyperplane_value(x,w,b,v):
return (-w[0]*x-b+v) / w[1]

datarange = (min_feature_value*0.9,max_feature_value*1.)
hyp_x_min = datarange[0]
hyp_x_max = datarange[1]

# (w.x+b) = 1
# positive support vector hyperplane
psv1 = hyperplane_value(hyp_x_min, w, b, 1)
psv2 = hyperplane_value(hyp_x_max, w, b, 1)
ax.plot([hyp_x_min,hyp_x_max],[psv1,psv2], 'k')

# (w.x+b) = -1
# negative support vector hyperplane
nsv1 = hyperplane_value(hyp_x_min, w, b, -1)
nsv2 = hyperplane_value(hyp_x_max, w, b, -1)
ax.plot([hyp_x_min,hyp_x_max],[nsv1,nsv2], 'k')

# (w.x+b) = 0
# positive support vector hyperplane
db1 = hyperplane_value(hyp_x_min, w, b, 0)
db2 = hyperplane_value(hyp_x_max, w, b, 0)
ax.plot([hyp_x_min,hyp_x_max],[db1,db2], 'y--')

plt.axis([-5,10,-12,-1])
plt.show()
In [9]:
visualize(data_dict)

In [10]:
def predict(features):
# sign( x.w+b )
dot_result = np.sign(np.dot(np.array(features),w)+b)
return dot_result.astype(int)

for i in X[:5]:
print(predict(i),end=', ')
1, 1, -1, 1, -1,
In [11]:
l=[]
for xi in X:

l.append(predict(xi[:6]))
l=np.array(l).astype(int)
l
Out[11]:
array([ 1, 1, -1, 1, -1, -1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1,
1, -1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1])
In [12]:
X[4]
Out[12]:
array([-1.8171622 , -9.22909875])
In [13]:
for i, v in enumerate(y):
if v==0:
y[i]=-1
y
Out[13]:
array([ 1, 1, -1, 1, -1, -1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1,
1, -1, 1, -1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1])
In [14]:
error = sum((l-y)**2)
In [15]:
error
Out[15]:
0

DISCUSSIONS
SVMs can be used to solve various real-world problems:
SVMs are helpful in text and hypertext categorization as their application can
significantly reduce the need for labeled training instances in both the standard
inductive and transductive settings.
Classification of images can also be performed using SVMs. Experimental results
show that SVMs achieve significantly higher search accuracy than traditional query
refinement schemes after just three to four rounds of relevance feedback. This is also true
of image segmentation systems, including those using a modified version SVM that uses
the privileged approach as suggested by Vapnik.
Hand-written characters can be recognized using SVM.
The SVM algorithm has been widely applied in the biological and other sciences. They have
been used to classify proteins with up to 90% of the compounds classified
correctly. Permutation tests based on SVM weights have been suggested as a
mechanism for interpretation of SVM models. Support vector machine weights have also
been used to interpret SVM models in the past. Postho interpretation of support vector
machine models in order to identify features used by the model to make
predictions is a relatively new area of research with special significance in biology.

FINDINGS AND LEARNINGS


Advantages
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes
algorithm. They also use less memory because they use a subset of training points in the
decision phase. SVM works well with a clear margin of separation and with high dimensional
space.
Disadvantages
SVM is not suitable for large datasets because of its high training time and it also takes more time in
training compared to Naïve Bayes. It works poorly with overlapping classes and is also sensitive to
the type of kernel used.
EXPERIMENT NUMBER 9
AIM
To implement the K Nearest Neighbours classification algorithm.

THEORY
The K-nearest neighbors (KNN) algorithm is a type of supervised machine learning algorithms. KNN
is extremely easy to implement in its most basic form, and yet performs quite complex classification
tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses
all of the data for training while classifying a new data point or instance. KNN is a non-parametric
learning algorithm, which means that it doesn't assume anything about the underlying data. This is
an extremely useful feature since most of the real world data doesn't really follow any theoretical
assumption e.g. linear-separability, uniform distribution, etc.
The intuition behind the KNN algorithm is one of the simplest of all the supervised machine learning
algorithms. It simply calculates the distance of a new data point to all other training data points.
The distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data
points, where K can be any integer. Finally it assigns the data point to the class to which the
majority of the K data points belong.

CODE and OUTPUT


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("C:/Users/Shivam gupta/DataScience-Python3/KNN_Data.csv",header=None)
data.head(10)
data[0] = data[0]/max(data[0])*100
data.head(10)
test = np.array([ [45,45],[45,35], [45,30],[35,40], [70,32], [70,34], [80,24], [75,30]])
plt.scatter(data[0],data[1],c=data[2],s=30)
plt.scatter(test[:,0], test[:,1],c='r',s=70, marker = "*")
plt.title("Red Data Points (stars) are unclassified.")
plt.show()

def euclid(val,pt):
d = (val[0] - pt[0])**2 + (val[1] - pt[1])**2
d = np.sqrt(d)
return d
K=4
ans = []
for val in test:
dist = []
for pt in data.values:
d = euclid(val,pt)
dist.insert(len(dist),np.append(d,pt[2]))

dist.sort(key=lambda x: x[0])
freq = {0:0, 1:0, 2:0}
for i in range(K):
freq[dist[i][1]] += 1
print(freq)
cl = sorted(freq.items(), key = lambda x: x[1], reverse = True)
ans.insert(len(ans),cl[0][0])
plt.scatter(data[0],data[1],c=data[2],s=30)
plt.scatter(test[:,0], test[:,1],c=ans,s=70, marker = "*")
plt.show()

DISCUSSIONS
KNN can be used for classification — the output is a class membership (predicts a class — a discrete
value). An object is classified by a majority vote of its neighbors, with the object being assigned to
the class most common among its k nearest neighbors. It can also be used for regression — output
is the value for the object (predicts continuous values). This value is the average (or median) of the
values of its k nearest neighbors.
Pros:
No assumptions about data — useful, for example, for nonlinear data
Simple algorithm — to explain and understand/interpret
High accuracy (relatively) — it is pretty high but not competitive in comparison to better
supervised learning models
Versatile — useful for classification or regression
Cons:
Computationally expensive — because the algorithm stores all of the training data
High memory requirement
Stores all (or almost all) of the training data
Prediction stage might be slow (with big N)
Sensitive to irrelevant features and the scale of the data

FINDINGS AND LEARNINGS


The algorithm can be summarized as:
A positive integer k is specified, along with a new sample
We select the k entries in our database which are closest to the new sample
We find the most common classification of these entries
This is the classification we give to the new sample
A few other features of KNN:
KNN stores the entire training dataset which it uses as its representation.
KNN does not learn any model.
KNN makes predictions just-in-time by calculating the similarity between an input sample
and each training instance.
EXPERIMENT NUMBER 10
AIM
To implement the K Means Clustering algorithm.

THEORY
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the
data, with the number of groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are provided. Data points are
clustered based on feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data.
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. Each centroid of a cluster is a collection of feature values
which define the resulting groups. Examining the centroid feature weights can be used to
qualitatively interpret what kind of group each cluster represents.
k-means clustering is a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. k-means clustering aims to partition n observations into
k clusters in which each observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

CODE and OUTPUT


import random
from copy import deepcopy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("C:/Users/Shivam gupta/DataScience-
Python3/KMeans_Data.csv",header=None)
data.head(10)
data.shape

data[0] = data[0]/max(data[0])*100
data.head(10)

plt.scatter(data[0],data[1],c='g')
k=3
# Number of training data
n = data.shape[0]
# Number of features in the data
c = data.shape[1]

# Generate random centers, here we use sigma and mean to ensure it represent the whole data
mean = np.mean(data, axis = 0)
std = np.std(data, axis = 0)
centers = np.random.randn(k,c) * np.array(std) + np.array(mean)
centers

plt.scatter(data[0], data[1], s=7)


plt.scatter(centers[:,0], centers[:,1], marker='*', c='g', s=150)
centers_old = np.zeros(centers.shape) # to store old centers
centers_new = deepcopy(centers) # Store new centers

clusters = np.zeros(n)
distances = np.zeros((n,k))

error = np.linalg.norm(centers_new - centers_old)


# When, after an update, the estimate of that center stays the same, exit loop
while error != 0:
# Measure the distance to every center
for i in range(k):
distances[:,i] = np.linalg.norm(data - centers[i], axis=1)
# Assign all training data to closest center
clusters = np.argmin(distances, axis = 1)
centers_old = deepcopy(centers_new)
# Calculate mean for every cluster and update the center
for i in range(k):
centers_new[i] = np.mean(data[clusters == i], axis=0)
error = np.linalg.norm(centers_new - centers_old)
centers_new
plt.scatter(data[0], data[1],c=clusters, s=7)
plt.scatter(centers_new[:,0], centers_new[:,1], marker='*', c='g', s=150)

DISCUSSIONS
The K-means clustering algorithm is used to find groups which have not been explicitly
labelled in the data. This can be used to confirm business assumptions about what types of groups
exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the
groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases
are:
Behavioural segmentation:
o Segment by purchase history
o Segment by activities on application, website, or platform
o Define personas based on interests
o Create profiles based on activity monitoring
Inventory categorization:
o Group inventory by sales activity
o Group inventory by manufacturing metrics
Sorting sensor measurements:
o Detect activity types in motion sensors
o Group images
o Separate audio
o Identify groups in health monitoring
Detecting bots or anomalies:
o Separate valid activity groups from bots
o Group valid activity to clean up outlier detection

FINDINGS AND LEARNINGS


The k mean algorithm has the following advantages and disadvantages: -
Advantages
Fast, robust and easier to understand.
Relatively efficient: O(tknd), where n is number of objects, k is clusters, d is number of
dimensions of each object, and t is number of iterations. Normally, k, t, d << n.
Gives best result when data set are distinct or well separated from each other.
Disadvantages
The learning algorithm requires apriori specification of the number of cluster centers.
The use of Exclusive Assignment - If there are two highly overlapping data then k-means will
not be able to resolve that there are two clusters.
The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get different results.
Euclidean distance measures can unequally weight underlying factors.
The learning algorithm provides the local optima of the squared error function.
Randomly choosing of the cluster center cannot lead us to the fruitful result.
Applicable only when mean is defined i.e. fails for categorical data.
Unable to handle noisy data and outliers.
Algorithm fails for non-linear data set.

You might also like