0% found this document useful (0 votes)
15 views39 pages

Data Mining Using Python

The document outlines the vision and mission of the Computer Science and Engineering Department, emphasizing the development of theoretical and practical skills in students. It details the Program Educational Objectives, Programme Outcomes, and Programme Specific Outcomes aimed at enhancing employability and addressing societal needs. Additionally, it includes practical coding examples for data preprocessing and similarity measures using Python libraries.

Uploaded by

Renuka Gabbeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Data Mining Using Python

The document outlines the vision and mission of the Computer Science and Engineering Department, emphasizing the development of theoretical and practical skills in students. It details the Program Educational Objectives, Programme Outcomes, and Programme Specific Outcomes aimed at enhancing employability and addressing societal needs. Additionally, it includes practical coding examples for data preprocessing and similarity measures using Python libraries.

Uploaded by

Renuka Gabbeta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Vision of the Department

TheComputerScience&Engineeringaimsatprovidingcontinuouslystimulatingeducationalenvironmenttoits students for


attaining their professional goals and meet the global challenges.
Missionofthe Department
 DM1: To develop a strong theoretical and practical background across the computer science discipline with
an emphasis on problem solving.
 DM2:Toinculcateprofessionalbehaviourwithstrongethicalvalues,leadershipqualities,innovative thinking and
analytical abilities into the student.
 DM3:Exposethestudents tocuttingedgetechnologies whichenhancetheiremployabilityand knowledge.
 DM4:Facilitatethefaculty tokeeptrackoflatestdevelopmentsintheirresearchareasandencouragethe faculty to
foster the healthy interaction with industry.
ProgramEducationalObjectives (PEOs)
 PEO1:Pursuehighereducation,entrepreneurshipandresearchtocompeteatgloballevel.
 PEO2:Designanddevelop productsinnovativelyincomputerscienceandengineeringand inother allied fields.
 PEO3: Function effectivelyas individuals and as membersof ateam in the conduct ofinterdisciplinaryprojects;
and even at all the levels with ethics and necessary attitude.
 PEO4:Serveever-changingneeds ofsocietywithapragmatic perception.

PROGRAMMEOUTCOMES(POs):

Engineering knowledge: Apply the knowledge of mathematics, science, engineering


PO1 fundamentals,andanengineeringspecializationtothesolutionofcomplex
engineeringproblems.
Problemanalysis:Identify,formulate,reviewresearchliterature,andanalyze complex
PO2 engineeringproblemsreachingsubstantiatedconclusions usingfirst
principlesofmathematics,naturalsciences,andengineeringsciences.
Design/development of solutions: Design solutions for complex engineeringproblems
and design system components or processes that meet the specified needs
PO3
withappropriateconsiderationforthepublichealthandsafety,andthecultural,
societal,and environmental considerations.
Conductinvestigationsofcomplexproblems:Useresearch-basedknowledgeand
PO4 researchmethodsincludingdesignofexperiments,analysisandinterpretationofdata,
andsynthesisof theinformationto providevalid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources,
PO5 andmodernengineeringandITtoolsincludingpredictionandmodelingtocomplex
engineeringactivitieswithan understandingofthe limitations.
Theengineerandsociety:Applyreasoninginformedbythecontextualknowledgeto
PO6 assesssocietal,health,safety,legalandculturalissuesandtheconsequent responsibilities
relevant to the professional engineering practice.
Environmentandsustainability:Understandtheimpactoftheprofessional
PO7 engineeringsolutionsinsocietalandenvironmentalcontexts,anddemonstratethe
knowledgeof, andneedforsustainable development.
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Ethics:Applyethicalprinciplesandcommittoprofessionalethicsandresponsibilities
PO8
andnormsof theengineeringpractice.
Individualandteam work:Functioneffectivelyasanindividual,andasamemberor
PO9 leaderindiverseteams,andinmultidisciplinarysettings.
Communication:Communicateeffectivelyoncomplexengineeringactivitieswiththe
engineering community and with society at large, such as, being able to comprehend
PO10
andwriteeffectivereportsanddesigndocumentation,makeeffectivepresentations,
andgiveandreceiveclear instructions.
Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
PO11
memberandleaderinateam,tomanageprojectsandinmultidisciplinary
environments.
Life-long learning: Recognize the need for, and have the preparation and ability to
PO12 engageinindependentandlife-longlearninginthebroadestcontextoftechnological
change

PROGRAMMESPECIFICOUTCOMES (PSOs):
The ability to apply Software Engineering practices and strategies in software
PSO
projectdevelopmentusingopen-sourceprogrammingenvironmentforthesuccessof
1
organization.
PSO Theabilitytodesignanddevelopcomputerprogramsinnetworking,webapplications
2 andIoTasperthesocietyneeds.
PSO
Toinculcateanabilitytoanalyze,designandimplementdatabaseapplications.
3

AIM: Demonstratingthefollowingdatapreprocessingtasksusingpythonlibraries

Description/Theory:
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.Data
Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words,
whenever the data is gathered fromdifferent sources it is collected in raw format which is not feasible
for the analysis.

PROCEDURE/CODE:
a) Loadingthedataset
import numpy as np
import pandas as pd
dataset=pd.read_csv('D:/DM/file1.csv')
print(dataset)
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
x=dataset.iloc[:,:-1].values

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
print(x)

b) Identifyingthedependentandindependentvariables.
dataset=pd.read_csv('D:/DM/file2.csv')
print(dataset)
x=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
print(x)
print(y)
x=dataset.iloc[1:,1:-1].values
y=dataset.iloc[:,[0,4]].values
z=dataset.iloc[:0].values
print(x)
print(y)
print(z)

c) Dealingwithmissingdata
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(x[:,1:3])x[:,1:3]=imputer.transform(x[:,1:3])

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
print(x)2 Page|2
imputer=imputer.fit(x[:,1:])
x[:,1:]=imputer.transform(x[:,1:])
print(x)

importnumpyasnp
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy='mean') imp.fit([[1,2],[np.nan,3],
[7,6]])
SimpleImputer() x=[[np.nan,2],[6,np.nan],
[7,6]] print(imp.transform(x))

OUTPUT:
a) Roll NameAgeBranchMarks1Marks2Marks3
0 1Ramesh 19 CSE 28 27 25
1 2Suresh 19 IT 24 29 26
2 3Geetha 18 EEE 19 28 27
3 4Seetha 19 ECE 26 19 28
4 5Sheela19 MECH 27 18 29
5 6 Ram 20 CSE 20 27 20
6 7 Ravi 19 IT 21 26 30
7 8 Mani 19 ECE 30 25 19
8 9 Srinu 20 EEE 28 24 18
9 10 Devi 19 CSE 18 23 23
[[1'Ramesh'19'CSE'2827]
[2'Suresh'19'IT'2429]
[3'Geetha'18'EEE'1928]
[4'Seetha'19'ECE'2619]
[5'Sheela'19'MECH'2718]
[6 'Ram'20 'CSE'2027]
[7'Ravi'19'IT'21 26]
[8'Mani'19 'ECE'3025]
[9'Srinu'20'EEE'2824]
[10'Devi'19'CSE'18 23]]

b) CountryGender Age SalaryPurchased


0 France Male44.072000.0 No
1 SpainFemale27.048000.0 Yes
2 Germany Male30.054000.0 No
3 Spain Male38.061000.0 No

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
4 GermanyFemale40.0 NaN Yes P age|3
5 FranceFemale35.058000.0 Yes
6 SpainFemale NaN52000.0 No

7 France Male48.079000.0 Yes


8 Germany Male50.083000.0 No
9 France Male37.069000.0 Yes
[['France''Male'44.072000.0]
['Spain''Female'27.048000.0]
['Germany''Male'30.054000.0]

c)[['Female'27.048000.0]
['Male'30.0 54000.0]
['Male'38.0 61000.0]
['Female'40.063000.0]
['Female'35.058000.0]
['Female'38.12552000.0]
['Male'48.0 79000.0]
['Male'50.0 83000.0]
['Male'37.069000.0]]

[[4. 2. ]
[6. 3.66666667]
[7. 6. ]]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|4

AIM: Demonstratethefollowingdatapreprocessingtasksusingpythonlibraries

Description/Theory:
Categoricaldataisatypeofdatathatisusedtogroupinformationwithsimilarcharacteristics, while
numerical data is a type of data that expresses information inthe form of numbers.
Exampleofcategoricaldata:gender
Categoricalvariablescanbedividedintotwocategories:
Nominal:noparticularorder
Ordinal:thereissomeorderbetweenvalues
PROCEDURE/CODE:

a) Dealingwithcategoricaldata.
importnumpyasnp
from sklearn.compose import ColumnTransformer
fromsklearn.preprocessingimportOneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthroug
h')
x=np.array(ct.fit_transform(x))
print(x)
fromsklearn.preprocessingimportLabelEncoder le
= LabelEncoder()
y=le.fit_transform(y)
print(y)

b) Scalingthefeatures.
fromsklearn.preprocessingimportStandardScaler
sc=StandardScaler()
x_train[:,6:]=sc.fit_transform(x_train[:,6:])
x_test[:,6:]=sc.transform(x_test[:,6:])
print(x_train)
print(x_test)

c) SplittingdatasetintoTrainingandTestingSets
fromsklearn.model_selectionimporttrain_test_split

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1) Page|5
print(x_train)
print(x_test)
print(y_train)
print(y_test)

OUTPUT
a)[[0.01.00.00.0'Male'44.072000.0]
[1.00.00.01.0'Female'27.048000.0]
[1.00.01.0 0.0 'Male'30.054000.0]
[1.00.00.0 1.0 'Male'38.061000.0]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.00.0'Female'35.058000.0]
[1.00.00.01.0'Female'nan52000.0]
[0.01.00.0 0.0 'Male'48.079000.0]
[1.00.01.0 0.0 'Male'50.083000.0]
[0.01.00.00.0'Male'37.069000.0]]

[0100110101]

b)[[1.00.00.01.0'Female'nan-1.0182239953527132]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.00.0'Male'44.00.5834766714942513]
[1.00.00.01.0'Male'38.0-0.2974586952715791]
[1.00.00.01.0'Female'27.0 -1.3385641287221062]
[0.01.00.00.0'Male'48.01.1440719048906889]
[1.00.01.00.0'Male'50.0 1.4644120382600818]
[0.01.00.00.0'Female'35.0 -0.5377137952986238]]
[[1.00.01.00.0'Male'30.0-0.8580539286680167]
[0.01.00.00.0'Male'37.00.3432215714672067]]

c) [[1.00.00.01.0'Female'nan 52000.0]
[1.00.01.00.0'Female'40.0nan]
[0.01.00.0 0.0 'Male'44.072000.0]
[1.00.00.0 1.0 'Male'38.061000.0]
[1.00.00.01.0'Female'27.048000.0]
[0.01.00.0 0.0 'Male'48.079000.0]
[1.00.01.0 0.0 'Male'50.0 83000.0]
[0.01.00.00.0'Female'35.058000.0]]
[[1.00.01.00.0'Male'30.054000.0]
[0.01.00.00.0'Male'37.069000.0]]
[01001101]
[01]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|6

AIM:Demonstratethefollowing SimilarityandDissimilarityMeasuresusing python


a) Pearson’s Correlation
b) CosineSimilarity
c) JaccardSimilarity
d) EuclideanDistance
e) ManhattanDistance
Description/Theory:
The Pearson correlation coefficient, often referred to as Pearson’s r, is a measure of linear
correlationbetweentwovariables.This means that the Pearson correlation coefficient measures a
normalized measurement of covariance (i.e., a value between -1 and 1 that shows how much
variables vary together).

Cosinesimilarityisametricusedtomeasurethesimilarityoftwovectors.Specifically,it
measures the similarity in the direction or orientation of the vectors ignoring differen

TheJaccardsimilaritymeasuresthesimilaritybetweentwo setsofdatatoseewhichmembersare
sharedanddistinct.TheJaccardsimilarityiscalculatedbydividingthenumberofobservationsin both
sets by the number of observations in either set ces in their magnitude or scale.

Euclidean distance is considered the traditional metric for problems with geometry. It can be
simply explained as the ordinary distance between two points. It is one of the most used
algorithmsintheclusteranalysis.OneofthealgorithmsthatusethisformulawouldbeK-mean.
Mathematically it computes the root of squared differences between the coordinates between
two objects.

Thisdeterminestheabsolutedifferenceamongthepairofthecoordinates.
SupposewehavetwopointsPandQtodeterminethedistancebetweenthese pointswesimply have to
calculate the perpendicular distance of the points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
ManhattandistancebetweenPandQ=|x1 –x2|+|y1–y2|
PROCEDURE/CODE:

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|7
a) Pearson’sCorrelation
#Pearson'sCorrelation
fromscipy.statsimportpearsonr
X=[-2,-1,0,1,2]
Y=[4,1,3,2,0]
corr=pearsonr(X,Y)
print('PearsonCorrelation:',corr)

b)CosineSimilarity
importnumpyasnp
fromnumpy.linalgimportnorm

#definetwolistsorarray
a=np.array([2,1,2,3,2,9])
b=np.array([3,4,2,4,5,5])
print("a : ",a)
print("b:",b)

#Compute cosine Similarity


cosine=np.dot(a,b)/(norm(a)*norm(b))
print("Cosine Similarity : ",cosine)

c)JaccardSimilarity
importnumpyasnp
fromscipy.spatial.distanceimportjaccard
a=np.array([1,0,0,1,1,1])
b=np.array([0,0,1,1,1,1])
d=jaccard(a,b)
print("Distance : ",d)

d)EuclideanDistance
fromsklearn.metrics.pairwiseimporteuclidean_distances X=[[0,1],
[1,1]]
euclidean_distances(X,X)
#Calculatingeuclideandistancebetweenvectors from
scipy.spatial.distance import euclidean
row1=[10,20,15,10,5]
row2=[12,24,18,8,7]
dist=euclidean(row1,row2)
print(dist)

minkowskiDistance
fromscipy.spatialimportminkowski_distance

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
row1=[10,20,15,10,5] Page|8
row2=[12,24,18,8,7]
#Calculate distance (p=1)
dist=minkowski_distance(row1,row2,1)
print(dist)
#Calculate distance (p=2)
dist=minkowski_distance(row1,row2,2)
print(dist)

#pearsonwithoutusingmethods
importmath
n=int(input('Enternumberofrows:')) x,y
= [],[]
foriinrange(n):
print('Enter'+str(i+1)+'rowxvalue:',end="")
x.append(int(input()))
print('Enter'+str(i+1)+'rowyvalue:',end="")
y.append(int(input()))
print("Enteredvalues:")
print(x)
print(y)
x2=[] #List for storing x^
y2=[] #Listforstoringy^2
xy=[] #List for storing xy
for i in range(n):
x2.append(math.pow(x[i],2))
y2.append(math.pow(y[i],2))
xy.append(x[i]*y[i])
print('x square : ',x2)
print('y square : ',y2)
print('xy value : ',xy)
rxy=sum(xy) // n-((sum(x)//n) * (sum(y) // n)) #rhoxy
rx =math.sqrt(sum(x2) // n-math.pow((sum(x)//n),2))ry
=math.sqrt(sum(y2) // n-math.pow((sum(y)//n),2))
result = rxy//(rx*ry)
print("Pearson'scorrelationcoefficient:",result)

OUTPUT:
a) PearsonCorrelation:PearsonRResult(statistic=-0.7000000000000001,
pvalue=0.18812040437418737)
Page|9
b) a :[212329]
b:[342455]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
CosineSimilarity:0.8188504723485274 Page|9
c) Distance: 0.4

d) array([[0.,1.],
[1.,0.]])
6.082762530298219

e) 10

Minkowskidistance
13.0
6.082762530298219

Pearson
Enternumberofrows:5
Enter 1 row x value : -2
Enter 1 row y value : 4
Enter 2 row x value : -1
Enter 2 row y value : 1
Enter 3 row x value : 0
Enter 3 row y value : 3
Enter 4 row x value : 1
Enter 4 row y value : 2
Enter 5 row x value : 2
Enter 5 row y value : 0
Entered values :
[-2,-1,0,1,2]
[4,1,3,2, 0]
xsquare:[4.0,1.0,0.0,1.0,4.0]
ysquare:[16.0,1.0,9.0,4.0,0.0]
xyvalue:[-8,-1,0,2,0]
Pearson'scorrelationcoefficient:-1.0

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|10

AIM:Buildamodelusinglinearregressionalgorithmonanydataset.
Description/Theory:

PROCEDURE/CODE:
Linearregressionisthetypeofregressionthatformsarelationshipbetweenthetargetvariable and one
or more independent variables utilizing a straight line. The given equation represents the
equation of linear regression
Y= a +b*X+ e.
#Buildingalinearregressionmodelfordiabetesdataset
importmatplotlib.pyplotasplt
import numpy as np
fromsklearnimportdatasets,linear_model
from sklearn.metrics import mean_squared_error,r2_score
diabetes_X,diabetes_Y=datasets.load_diabetes(return_X_y=True) diabetes_X
= diabetes_X[:,np.newaxis,2]

#Split data into training/testing sets


diabetes_X_trian=diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

#Split the dependent variables


diabetes_y_train=diabetes_Y[:-20]
diabetes_y_test = diabetes_Y[-20:]

#CreateanobjectforLinearRegression
regr=linear_model.LinearRegression()

#Train the model using the training sets


regr.fit(diabetes_X_trian,diabetes_y_train)

#Make predictions using training sets


diabetes_y_pred=regr.predict(diabetes_X_test)

print(diabetes_y_pred)
Plotting:
plt.scatter(diabetes_X_test,diabetes_y_test,color='black')

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.plot(diabetes_X_test,diabetes_y_pred,color='blue',linewidth=2) Page|11
plt.show()
Grid:

plt.scatter(diabetes_X_test,diabetes_y_test,color='black')
plt.plot(diabetes_X_test,diabetes_y_pred,color='blue',linewidth=2,marker='*',markerfacecolor='r
ed')
plt.grid()
plt.show()
LinearRegression:
importmatplotlib.pyplotasplt
import numpy as np
fromsklearn.linear_modelimportLinearRegression
x=np.array([5,15,45,25,35,45]).reshape((-1,1))
y=np.array([5,21,14,22,32,28])
print(x,y)

x=np.array([5,15,45,25,35,45]).reshape((-1,1))
y=np.array([5,21,14,22,32,28])
print(x,y)
model=LinearRegression()
model.fit(x,y)
result=model.score(x,y)

print("Score :",result)model.fit(x,y)
result=model.score(x,y)print("Score
: ",result)
print("Intercept:",model.intercept_)
print("Slope : ",model.coef_)
y_pred = model.predict(x) #predictedbyyvalue
print("Actual values of y : ",y);
print("Predictedvaluesofy:",y_pred)

plt.show()
plt.plot(x,y_pred,color='blue',linewidth=2,marker='o',markerfacecolor='red')

[<matplotlib.lines.Line2Dat0x23b679e6850>]

plt.plot(x,y_pred,color='blue',linewidth=2,marker='o',markerfacecolor='red')
plt.grid()
plt.show()

MultipleLinearRegression

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
#Importinglibraries Page|12
importnumpyasnp
importmatplotlib.pyplotasplt
import pandas as pd

#Importing the dataset


dataset=pd.read_csv('D:/DM/dataset.csv')
dataset.describe()
shape=dataset.shape
print(shape)dataset.columns
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:, -1].values
print(X)

#EncodingCategoricaldata
from sklearn.compose import ColumnTransformer
fromsklearn.preprocessingimportOneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[3])],remainder='passthroug
h')
X=np.array(ct.fit_transform(X))
print(X)

#SplitingthedatasetintotheTrainingsetandTestset from
sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

#TrainingtheMultipleLinearRegressionmodelontheTrainingset from
sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)

#Predicting the Test set results


y_pred=regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

plt.scatter(y_test,y_pred);
plt.xlabel('Actual')
plt.ylabel('Predicted')
pred_df=pd.DataFrame({'ActualValue':y_test,'PredictedValue':y_pred,'Difference':y_test- y_pred})
pred_df

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 13
OUTPUT:
[225.9732401115.74763374163.27610621114.73638965120.80385422
158.21988574236.08568105121.8150983299.56772822123.83758651
204.7371141196.53399594154.17490936130.9162951783.3878227
171.36605897137.99500384137.99500384189.5684526884.3990668]
Plotting:

GRID:

[[5]
[15]
[45]
[25]
[35]
[45]][52114 223228]

[[5]
[15]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
[45]
[25] Page|14
[35]
[45]][52114 223228]
Score:0.3114260563380282
Score:0.3114260563380282
Intercept:10.912499999999996
Slope:[0.3325]
Actual valuesofy:[521142232 28]
Predicted values of y :[12.575 15.9 25.87519.22522.5525.875]

plt.scatter(x,y,color='black')

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|15

MultipleLinearRegression:
(50,5)
[[165349.2136897.8471784.1'NewYork']
[162597.7151377.59443898.53'California']
[153441.51101145.55407934.54'Florida']
[144372.41118671.85383199.62'NewYork']
[142107.3491391.77366168.42'Florida']
[131876.999814.71362861.36'NewYork']
[134615.46147198.87127716.82'California']
[130298.13145530.06323876.68'Florida']
[120542.52148718.95311613.29'NewYork']
[123334.88108679.17304981.62'California']
[101913.08110594.11229160.95'Florida']
[100671.9691790.61249744.55'California']
[93863.75127320.38249839.44'Florida']
[91992.39135495.07252664.93'California']
[119943.24156547.42256512.92'Florida']

Actual Value PredictedValue Difference

103282.38 103015.201598 267.178402

144259.40 132582.277608 11677.122392

146121.95 132447.738452 13674.211548

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|16

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|17

AIM:BuildaclassificationmodelusingDecisionTreealgorithmonirisdataset Description/Theory:
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostlyit is preferred for solving Classification problems. It is atree-
structuredclassifier,whereinternalnodesrepresentthefeaturesofadataset,branchesrepresent the
decision rules and each leaf node represents the outcome.

PROCEDURE/CODE:
#importlibraries
fromsklearn.datasetsimportload_iris
fromsklearn.treeimportDecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
importmatplotlib.pyplotasplt
#Load iris dataset
iris=load_iris()
X=iris.data
y=iris.target
#Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=None)
#Create an instance of the DecisionTreeClassifier class
tree_clf=DecisionTreeClassifier(max_depth=3)
#Fit the model on the training data
tree_clf.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=3)
#predict on the testing data
y_pred=tree_clf.predict(X_test)
#Calculate accuracy of the model
accuracy=accuracy_score(y_test,y_pred)
print('Accuracy : ',accuracy)
#Visualizethedecisiontreeusingtheplot_treefunction
plt.figure(figsize=(15,10))
plot_tree(tree_clf,filled=True,feature_names=iris.feature_names,class_names=iris.target_names
)
plt.show()

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|18
OUTPUT:
Accuracy:0.9555555555555556

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|19

AIM:ApplyNaiveBayesClassificationalgorithmonanydataset. Description/Theory:
NaïveBayesalgorithmisasupervised learningalgorithm,whichisbasedonBayestheorem and used
for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
SomepopularexamplesofNaïveBayesAlgorithmarespamfiltration,Sentimentalanalysis, and
classifying articles.

PROCEDURE/CODE:
fromsklearnimportmetrics
fromsklearn.naive_bayesimportGaussianNB import
pandas as pd
importnumpyasnp
fromsklearn.model_selectionimporttrain_test_split
from sklearn.metrics import confusion_matrix
fromsklearn.metricsimportaccuracy_score
dataset=pd.read_csv('D:\DM\iris.csv')
print(dataset)

X=dataset.iloc[:,:4].values
Y=dataset['Species'].values
print(Y)
print(X)

#split the dataset into training and test datasets


X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3)
#create an object for Bayes Classifier GaussianNB
classifier=GaussianNB()
classifier.fit(X_train,Y_train)
#predict the values
print(X_test[0])
y_pred=classifier.predict(X_test)
print(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy: ",accuracy)

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 20
#buildconfusionmatrix
fromsklearn.metricsimportconfusion_matrix
cm=confusion_matrix(Y_test,y_pred)
from sklearn.metrics import accuracy_score
print("Accuracy: ",accuracy_score(Y_test,y_pred))
cm

df=pd.DataFrame({'RealValues':Y_test,'PredictedValues':y_pred}) print(df)

NaiveBayes:
#predictclasstablefor
fromsklearn.datasetsimportload_iris
fromsklearn.model_selectionimporttrain_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
fromsklearn.preprocessingimportLabelEncoder
import numpy as np
#import iris dataset
iris=load_iris()
X=iris.data
Y=iris.target
le=LabelEncoder()
Y=le.fit_transform(Y)
#Splitthedatasetintotrainingandtestingsets
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
#Train a Naive bayes model on the training data
nb_model = GaussianNB()
nb_model.fit(X_train, Y_train)
#Make predictions on the test data
y_pred=nb_model.predict(X_test)
y_pred = le.inverse_transform(y_pred)
accuracy=accuracy_score(Y_test,y_pred)
print("Accuracy: ",accuracy)
new_observation = np.array([[5.8, 3.0, 4.5, 1.5]])
predicted_class = nb_model.predict(new_observation)
predicted_class=le.inverse_transform(predicted_class)
print("Predicted class: ", predicted_class)

OUTPUT:
IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm\
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2 Page |21

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150rowsx6columns]
['Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-setosa''Iris-
setosa'
[[ 1. 5.1 3.5 1.4]
[ 2. 4.9 3. 1.4]
[ 3. 4.7 3.2 1.3]
[ 4. 4.6 3.1 1.5]
[ 5. 5. 3.6 1.4]
[ 6. 5.4 3.9 1.7]
[ 7. 4.6 3.4 1.4]
[ 8. 5. 3.4 1.5]
[ 9. 4.4 2.9 1.4]
[ 10. 4.9 3.1 1.5]
[11. 5.4 3.7 1.5]
['Iris-setosa''Iris-virginica''Iris-versicolor''Iris-
setosa''Iris-versicolor''Iris-virginica''Iris-virginica''Iris-
setosa''Iris-virginica''Iris-setosa''Iris-virginica''Iris-
virginica''Iris-setosa''Iris-versicolor''Iris-virginica''Iris-

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
virginica'

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Accuracy:1.0 Page|22
Accuracy:1.0
RealValuesPredictedValues
0 Iris-setosa Iris-setosa
1 Iris-virginica Iris-virginica
2 Iris-versicolorIris-versicolor
3 Iris-setosa Iris-setosa
4 Iris-versicolorIris-versicolor
5 Iris-virginica Iris-virginica

Accuracy:0.9777777777777777
Predictedclass: [1]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|23

AIM:GeneratefrequentitemsetsusingAprioriAlgorithminpythonandalsogenerate association
rules for any market basket data.

Description/Theory:
ApriorialgorithmisgivenbyR.AgrawalandR.Srikantin1994forfindingfrequentitemsetsin a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge
of frequent itemset properties. We apply an iterative approach or level-wise search where k-
frequent itemsets are used to find k+1 itemsets.

PROCEDURE/CODE:
pipinstallmlxtend

importpandasaspd
frommlxtend.preprocessingimportTransactionEncoder
frommlxtend.frequent_patternsimportapriori,association_rules data
=[['Bread', 'Milk', 'Eggs'],
['Bread','Diapers','Beer','Eggs'],
['Milk','Diapers','Beer','Cola'],
['Bread''Milk','Diapers','Beer','Cola','Eggs']]
te=TransactionEncoder()
te_ary=te.fit(data).transform(data)
df=pd.DataFrame(te_ary,columns=te.columns_) df
frequent_itemsets=apriori(df,min_support=0.75,use_colnames=True)
print(frequent_itemsets)

rules=association_rules(frequent_itemsets,metric='confidence',min_threshold=0.7)
print(rules)
selected_columns=['antecedents','consequents','antecedentsupport','consequentsupport',
'support', 'confidence']
print(rules[selected_columns])
Apriori:
pip install apyori
importnumpyasnp
importpandasaspd

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
importmatplotlib.pyplotasplt Page|24
fromapyoriimportapriori
store_data=pd.read_csv("D:/DM/store_data.csv",header=None)
display(store_data.head())
store_data.shape

records= []
foriinrange(1,7501):
records.append([str(store_data.values[i,j])forjinrange(0,20)])
print(type(records))

association_rules=apriori(records,min_support=0.0045,min_confidence=0.2,min_lift=3,
min_length=2)
association_results=list(association_rules)
print("Thereare{}Relationderived.".format(len(association_results))) for i
in range(0, len(association_results)):
print(association_results[i][0])

foriteminassociation_results:
pair = item[0]
items=[xforxin pair]
print("Rule:"+items[0]+"->"+items[1])
print("Support: " +str(item[1]))
print("Confidence: " +str(item[2][0][2]))
print("Lift: " +str(item[2][0][3]))
print("==========================================")

OUTPUT:
support itemsets
0 0.75 (Beer)
1 0.75 (Diapers)
2 0.75 (Eggs)
3 0.75(Beer, Diapers)
antecedentsconsequentsantecedentsupportconsequentsupportsupport\
0 (Beer) (Diapers) 0.75 0.75 0.75
1 (Diapers) (Beer) 0.75 0.75 0.75

confidence lift leverageconvictionzhangs_metric


0 1.0 1.333333 0.1875 inf 1.0
1 1.0 1.333333 0.1875 inf 1.0
antecedentsconsequentsantecedentsupportconsequentsupportsupport\
0 (Beer) (Diapers) 0.75 0.75 0.
75
1 (Diapers) (Beer) 0.75 0.75 0.
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
75

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page| 25
confidence
0 1.0
1 1.0
2
0 1 2 3 4 5 6 7 8 9 10 1 1 1 14 15 16 17 18 1
1 2 3 9
0 shr alm av veg gr w y co en to lo gr h s mi sal anti fro spi ol
im ond oc etab ee h a tta er m w e o al ne m oxy zen na iv
p s ad les n ol m ge gy at fa e ne a ral on dant sm ch e
o mix gr e s ch dr o t n y d wa juic oot oi
ap w ee in jui yo te ter e hie l
es ea se k ce gu a
tfl rt
o
ur

1 bu me eg Na N N N N N N N N N N Na N Na Na Na N
rg atb gs N a a a a a a a a a a N a N N N a
ers alls N N N N N N N N N N N N
2 ch Na Na Na N N N N N N N N N N Na N Na Na Na N
ut N N N a a a a a a a a a a N a N N N a
ne N N N N N N N N N N N N
y
3 tur avo Na Na N N N N N N N N N N Na N Na Na Na N
ke cad N N a a a a a a a a a a N a N N N a
y o N N N N N N N N N N N N
4 mi mil en wh gr N N N N N N N N N Na N Na Na Na N
ne k erg ole ee a a a a a a a a a N a N N N a
ral y whe n N N N N N N N N N N N
wa bar at te
ter rice a
(7501,20)

<class'list'>
There are 48 Relation derived.
frozenset({'chicken', 'light cream'})
frozenset({'escalope','mushroomcreamsauce'})
frozenset({'escalope', 'pasta'})
frozenset({'ground beef', 'herb & pepper'})

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
frozenset({'tomatosauce','groundbeef'}) Page|26
frozenset({'wholewheatpasta','oliveoil'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'chicken', 'light cream'})
frozenset({'shrimp','frozenvegetables','chocolate'})
frozenset({'cooking oil', 'ground beef', 'spaghetti'})

Rule: chicken->light cream


Support: 0.004533333333333334
Confidence:0.2905982905982906
Lift:4.843304843304844
==========================================
Rule:escalope->mushroomcreamsauce
Support: 0.005733333333333333
Confidence:0.30069930069930073
Lift: 3.7903273197390845

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|27

AIM:ApplyK-Meansclusteringalgorithmonanydataset

Description/Theory:
K-MeansClusteringisanUnsupervisedLearningalgorithm,whichgroupstheunlabeleddataset into
different clusters.
Thek-meansclusteringalgorithmmainlyperformstwotasks:
DeterminesthebestvalueforKcenterpointsorcentroidsbyaniterativeprocess.
Assigns each datapointtoits closestk-center.Thosedatapointswhich areneartotheparticular k-
center, create a cluster.
Henceeachclusterhasdatapointswithsomecommonalities,anditisawayfromotherclusters.

PROCEDURE/CODE:
importmatplotlib.pyplotasplt
import numpy as np
from sklearn.cluster import KMeans X=np.array([[1,1],[1.5,2],[3,4],[5,7],[3.5,5],[4.5,5],
[3.5,4.5]])
print(X)
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
print("\nClusters:",kmeans.cluster_centers_)print("\nLabels:
",kmeans.labels_)
plt.scatter(X[:,0],X[:,1],c=kmeans.labels_,cmap="rainbow")

OUTPUT:
[[1.1.]
[1.52.]
[3.4.]
[5.7.]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
[3.55.] Page |28
[4.55.]
[3.54.5]]

plt.scatter(X[:,0],X[:,1])

KMeans(n_clusters=2)
Clusters: [[1.25 1.5 ]
[3.9 5.1]]
Labels:[001 11 11]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|29

AIM:ApplyHierarchicalClusteringalgorithmonanydataset

Description/Theory:
HierarchicalClusteringinMachineLearning
Hierarchicalclusteringisanotherunsupervisedmachinelearningalgorithm,whichisusedto group
theunlabeleddatasets into a clusterand alsoknown ashierarchical cluster analysis or HCA.
Inthisalgorithm,wedevelopthehierarchyofclustersintheformofatree,andthistree-shaped structure
is known as the dendrogram.
SometimestheresultsofK-meansclusteringandhierarchicalclusteringmaylooksimilar,but they
both differ depending on how they work. As there is no requirement to predetermine the
number of clusters as we did in the K-Means algorithm.

PROCEDURE/CODE:
importpandasaspd
importnumpyas np
import matplotlib.pyplot as plt
importscipy.cluster.hierarchyasshc
fromscipy.spatial.distanceimportsquareform,pdist
a=np.random.random_sample(size=5)
b=np.random.random_sample(size=5)
point=['p1','p2','p3','p4','p5']
data=pd.DataFrame({'Point':point,'a':np.round(a,2),'b':np.round(b,2)})
data=data.set_index('Point')
data
plt.figure(figsize=(8,5))
plt.xlabel('Column a')
plt.ylabel('Column b')

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.title('Scatterplotofxand y') Page|30
plt.scatter(data['a'],data['b'],c='r',marker='*')
for j in data.itertuples():plt.annotate
(j.index,(j.a,j.b),fontsize=15)
dist=pd.DataFrame(squareform(pdist(data[['a','b']]),'euclidean'),
columns=data.index.values,
index=data.index.values)
dist
DANDROGRAM
plt.figure(figsize=(12,5))
plt.title('DendrogramwithSinglelinkage')
dend=shc.dendrogram(shc.linkage(data[['a','b']],method='single'),labels=data.index)

OUTPUT:
a b
P
oi
nt

p1 0.30 0.96
p2 0.31 0.54
p3 0.64 0.88
p4 0.67 0.15
p5 0.25 0.70
<Figuresize800x500with0 Axes>
<Figuresize800x500with0 Axes>

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
Page|31

p1 p2 p3 p4 p5
p1 0.000000 0.420119 0.349285 0.890505 0.264764
p2 0.420119 0.000000 0.473814 0.530754 0.170880
p3 0.349285 0.473814 0.000000 0.730616 0.429535
p4 0.890505 0.530754 0.730616 0.000000 0.692026
p5 0.264764 0.170880 0.429535 0.692026 0.000000
Dandrogram

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
DATE: Page|32

AIM:ApplyDBSCANclusteringalgorithmonanydataset

Description/Theory:
StepsUsedInDBSCANAlgorithm
Findalltheneighborpointswithinepsandidentifythecorepointsorvisitedwithmorethan MinPts
neighbors.
Foreachcorepointifitisnotalreadyassignedtoacluster,createanewcluster.
Findrecursivelyallitsdensity-connectedpointsandassignthemtothesameclusterasthecore point.
A point a and b are said to be density connected if there exists a point c which has a sufficient
number of points in its neighbors and both points a and b are within the eps distance. This is a
chainingprocess.So,ifbis aneighborofc,cisaneighborofd,anddis aneighborofe,which in turn
isneighbor of a implying that b is a neighbor of a.
Iteratethroughtheremainingunvisitedpointsinthedataset.Thosepointsthatdonotbelongto any
cluster are noise.

PROCEDURE/CODE:
importnumpyasnp
fromsklearn.datasetsimportmake_blobs
fromsklearn.preprocessingimportStandardScaler
import matplotlib.pyplot as plt
fromsklearn.clusterimportDBSCAN
centers=[[0.5,2],[-1,-1],[1.5,-1]]
X,y=make_blobs(n_samples=100,centers=centers,cluster_std=0.5,random_state=0)
print(X,y)
db=DBSCAN(eps=0.4,min_samples=5)
db.fit(X)
n_clusters_=len(set(labels))-(1 if -1 in labels else 0)
print("Estimatednumberofclusters:%d"%n_clusters_)
y_pred=db.fit_predict(X)

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
plt.figure(figsize=(6,4)) Page|33
plt.scatter(X[:,0],X[:,1],c=y_pred,cmap='Paired')
plt.title("Clusters determined by DBSCAN")
plt.savefig("DBSCAN.jpg")
KmeansforDBSCAN:
importmatplotlib.pyplotasplt
import numpy as np
fromsklearn.clusterimportKMeans
import matplotlib.pyplot as plt
import numpy as np
fromsklearn.clusterimportKMeans
plt.scatter(X[:,0],X[:,1])
kmeans=KMeans(n_clusters=2)
kmeans.fit(X)
print("\nClusters:",kmeans.cluster_centers_) print("\
n Labels:",kmeans.labels_)

OUTPUT:
[[0.57747371 2.18908126]
[1.1781908-2.11170158]
[0.533258612.15123595]
[0.94780833-0.97391746]
[1.27223375-0.99126042]
[1.266389612.73467938]
[0.588713071.79910953]
DBSCAN(eps=0.4)
labels=db.labels_

n_clusters_=len(set(labels))-(1 if -1 in labels else 0)


print("Estimatednumberofclusters:%d"%n_clusters_)
y_pred=db.fit_predict(X)
[0 -1011-1012-11221 -1-1-1212202

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE
2 0 2-1 2-1-1-12 1 1 1 1-11-1-11112-1 -10 Page | 34
21 0-1 0-101-1 1 1 12 - 2 2 0 0 0 12 -0 -1
1 1
01 0-1 02-1-12 1 1 01 1 0-1 0 1 0 22 1 0 2
12 22]

KMeans(n_clusters=2)

Clusters:[[0.25333922-0.89314775]
[0.492109632.00254955]]

Labels:[10100110010000 0100000 1 000101001100000


1001000010100 10111000000000111001101 0
1110100001001 0101000100000]

DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING-LBRCE

You might also like