0% found this document useful (0 votes)
4 views15 pages

Assign - 10903 - DMCT 1

The document contains a lab assignment by Keshav Dutt Gautam, focusing on data analysis using the WEKA tool, PCA for dimensionality reduction, data visualization techniques, and the implementation of the Apriori algorithm for association rule mining. It provides detailed explanations, Python code examples, and outputs for each task. The overall aim is to explore various aspects of data analysis and machine learning applications.

Uploaded by

btech10903.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

Assign - 10903 - DMCT 1

The document contains a lab assignment by Keshav Dutt Gautam, focusing on data analysis using the WEKA tool, PCA for dimensionality reduction, data visualization techniques, and the implementation of the Apriori algorithm for association rule mining. It provides detailed explanations, Python code examples, and outputs for each task. The overall aim is to explore various aspects of data analysis and machine learning applications.

Uploaded by

btech10903.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DMCT LAB ASSIGNMENT

NAME- Keshav Dutt Gautam


ROLL NO- BTECH/10903/22
BRANCH-CSE
SECTION-C
Assignment1:
Q1.ExploretheWekatoolfordataanalysis.
Answer:-
WEKA is an open-source software tool for data mining and machine learning, developed
bytheUniversityofWaikato.Itprovidesacollectionofvisualizationtoolsandalgorithmsfordataanal
ysisandpredictivemodeling,alongwithauser-friendlygraphicalinterface.Wekasupportstasks such
as data preprocessing, classification, regression, clustering, and association rules.It's widely
used in academic and research settings due to its ease of use and extensive
libraryofmachinelearningalgorithms.
WEKA–Introduction
The foundation of any Machine Learning application is data - not just a little data but a
hugedata whichistermedasBigDatainthe currentterminology.
Totrainthemachinetoanalyzebigdata,weneedtohaveseveralconsiderationsonthedata−
 Thedatamustbeclean.
 Itshouldnotcontainnullvalues.
Besides, not all the columns in the data table would be useful for the type of analytics that
weare trying to achieve. The irrelevant data columns or ‘features’ as termed in Machine
Learningterminology,mustberemovedbeforethedataisfedintoamachinelearningalgorithm.
Inshort,our big data needs lots of preprocessing before it can be used for Machine Learning.
Once thedata is ready, we would apply various Machine Learning algorithms such as
classification,regression,clusteringandsoontosolvethe problematourend.
The type of algorithms that we apply is based largely on our domain knowledge. Even
withinthe same type, for example classification, there are several algorithms available. We
may liketo test the different algorithms under the same class to build an efficient machine
learningmodel. While doing so, we would prefer visualization of the processed data and thus
we alsorequire visualizationtools.
If we observe the beginning of the flow of the image, we will understand that there are
manystagesindealingwithBigDatatomakeitsuitable formachinelearning−
First, we will start with the raw data collected from the field. This data may contain
severalnull values and irrelevant fields. We use the data preprocessing tools provided in
WEKA tocleanse thedata.
Then, wewouldsavethepreprocesseddatainyourlocalstorageforapplyingMLalgorithms.
Next, depending on the kind of ML model that you are trying to develop we would select
oneof the options such as Classify, Cluster, or Associate. The Attributes Selection allows
theautomatic selectionoffeaturestocreateareduceddataset.
Then, WEKA would give us the statistical output of the model processing. It provides us
avisualization tool to inspect the data.The various models can be applied on the same
dataset.Wecanthencomparetheoutputsofdifferentmodelsandselectthebestthatmeetsourpurpose.
Thus, the use of WEKA results in a quicker development of machine learning models on
thewhole.
ThefeaturesandcapabilitiesoftheWekatool:
1. UserInterface:
o Explorer
o Experimenter
o KnowledgeFlow
o SimpleCLI
2. DataPreprocessing:
o Handlingmissingvalues
o Normalizingdata
o Discretization
o Attributeselection
3. ClassificationandRegression:
o AlgorithmslikeDecisionTrees,NaiveBayes,SVM,k-NN
4. Clustering:
o Algorithmslikek-Means, ExpectationMaximization(EM)
5. AssociationRuleMining:
o AlgorithmslikeApriori
6. AttributeSelection
7. VisualizationTools
8. Extensionsand Integration:
o Extendablewithpackages
o IntegrationwithR, Python, Java
9. SupportforVariousFileFormats:
o CSV,ARFF,etc.

 TheWEKAGUI Chooserapplicationwillstartandwewouldseethefollowingscreen-

 WhenweclickontheExplorer
buttonintheApplicationsselector,itopensthefollowingscreen–

Conclusion-
WEKA is a powerful tool for developing machine learning models. It provides
implementationof several most widely used ML algorithms. Before these algorithms are
applied to yourdataset, it also allows you to preprocess the data. The types of algorithms that
are supportedare classified under Classify, Cluster, Associate, and Select attributes. The
result at
variousstagesofprocessingcanbevisualizedwithabeautifulandpowerfulvisualrepresentation.Thi
smakes it easier for a Data Scientist to quickly apply the various machine learning
techniquesonhisdataset,compare theresultsandcreatethe bestmodelforthefinaluse
Assignment2:
Q1. Given a dataset, reduce the dimension using PCA. Write a program in python for
thatpurpose.
Answer:-
To reduce the dimensionality of a dataset using Principal Component Analysis (PCA),
wefollow thesesteps:
StepstoPerformPCA
1. StandardizetheData:
o Ensurethatthedataisstandardized(mean=0and variance=1).Thisiscrucialbecause
PCAisaffectedbythescaleofthedata.
2. ComputetheCovarianceMatrix:
o Calculatethecovariancematrixofthestandardizeddata.Thecovariancematrixrepre
sentstherelationships(correlations)between thefeaturesinthedataset.
3. CalculateEigenvaluesandEigenvectors:
o Computetheeigenvaluesandeigenvectorsof
thecovariancematrix.Theeigenvectorsrepresentthedirectionsofthenewfeaturesp
ace(principalcomponents), and the eigenvalues indicate the magnitude
(variance) of thesecomponents.
4. SortEigenvectorsbyEigenvalues:
o Rank the eigenvectors by their corresponding eigenvalues in descending
order.The eigenvector with the highest eigenvalue is the principal component
thatexplainsthemostvarianceinthedata.
5. SelectPrincipalComponents:
o Choosethetopkeigenvectorsthatcorrespondtothelargesteigenvalues.Theseeigen
vectorsformthenew reducedfeaturespace.
6. TransformtheData:
o Projecttheoriginaldataontothenewfeaturespaceusingtheselectedeigenvectors.Th
isstepreducesthedimensionalityofthedatasetwhileretainingasmuchvarianceaspo
ssible.
7. AnalyzetheReducedDataset:
o Theresultingdatasetnowhasreduceddimensions,whichcanbeusedforfurtheranaly
sis,visualization,orasinput to machinelearningmodels.

Hence, PCA helps in reducing the dimensionality of a dataset by transforming it into a new
setof orthogonal features (principal components), while retaining most of the variance present
inthe originaldata.
PythonCode-
importpandasaspd
importmatplotlib.pyplotasplti
mportseabornassns
fromsklearn.decompositionimportPCA
fromsklearn.preprocessingimport StandardScaler,
LabelEncoder#Loadyourdataset
df=pd.read_csv(r'C:\Users\BIT.L1-0052\Desktop\birdsongdataset\
test.csv')#Encodecategoricaldataifpresent
if'species' indf.columns:#Adjustthecolumnnameas neededle
=LabelEncoder()
df['species']=le.fit_transform(df['species'])
#Convertcategoricalcolumnstonumerical(one-
hotencodingexample)df=pd.get_dummies(df,drop_first=True)
#Savetargetcolumnforplottingif's
pecies'indf.columns:
target=df['species']
#Dropnon-numericcolumns ifanyremain(optional)df=
df.select_dtypes(include=[float,int])
# Standardize the
datasetscaler=StandardS
caler()
scaled_data=scaler.fit_transform(df)
#ApplyPCA
pca=PCA(n_components=2)
pca_result =
pca.fit_transform(scaled_data)#CreateaDat
aFramewiththePCAresult
pca_df=pd.DataFrame(data=pca_result,columns=['PC1','PC2'])#A
ddthetargetcolumntothePCADataFrameforplotting
if 'species' in
df.columns:pca_df['species'
]=target
#Inspecttheresults
print("ExplainedVarianceRatio:",pca.explained_variance_ratio_)print("P
CAResult(First5rows):")
print(pca_df.head())
# Plot the PCA
resultsplt.figure(figsize=(
10,7))
sns.scatterplot(x='PC1', y='PC2', hue='species', data=pca_df, palette='viridis',
s=100,edgecolor='w')
plt.title('PCA of Bird Song
Dataset')plt.xlabel('PrincipalCompo
nent 1')
plt.ylabel('PrincipalComponent2')plt.
legend(title='Species')plt.grid(True)
plt.show()pri
nt(pca_df)

Output:
Assignment 3:
Q1.Exploreandexaminedifferenttypesofdatavisualizationtools/librariesusingthegivendataset.
PythonCode-
import pandas as
pdimportnumpyasn
p
importmatplotlib.pyplotasplt
fromsklearn.decompositionimportPCA
fromsklearn.preprocessing importStandardScaler

#loadthedata
train_data =
pd.read_csv('train.csv')test_data=pd
.read_csv('test.csv')

#check the first few


rowsprint(train_data.hea
d())

#checkthedatatypes
print("TrainingDataHead:\
n",train_data.head())print("TestingDataHead:\
n",test_data.head())

numeric_cols=train_data.select_dtypes(include=[np.number]).columnstrain_n
umeric =train_data[numeric_cols]
test_numeric=test_data[numeric_cols]

#Standardization
scalar=StandardScaler()
train_scaled=
scalar.fit_transform(train_numeric)test_scaled=sc
alar.transform(test_numeric)
#Co-variancematrix
cov_matrix=np.cov(train_scaled,rowvar=False)

#EigenvaluesandEigenvectors
eigenvalues,eigenvectors=np.linalg.eigh(cov_matrix)

print("Eigenvalues :\n",
eigenvalues)print("Eigenvectors:\
n",eigenvectors)

#PrincipalComponentAnalysis(PCA)p
ca = PCA(n_components=5)
train_pca=pca.fit_transform(train_scaled)

print("ExplainedVarianceRatio",pca.explained_variance_ratio_)

pca_df=pd.DataFrame(train_pca,columns=['PC1','PC2','PC3','PC4','PC5'])

#Box
Plotplt.figure(figsize=(10
, 5))plt.boxplot(pca_df)
plt.title('BoxPlotofPrincipalComponents')plt.xlab
el('Principal Components')plt.ylabel('Values')
plt.show()

#scatterplottingplt.figure(
figsize=(10,5))
plt.scatter(pca_df['PC1'],pca_df['PC2'],alpha=0.5)p
lt.title('Scatter Plot of Principal
Components')plt.xlabel('PrincipalComponent1')
plt.ylabel('PrincipalComponent2')plt.
grid(True)
plt.show()

#Violin
plotplt.figure(figsize=(10
,5))
plt.violinplot(pca_df,
showmedians=True)plt.title('ViolinPlotofPrincipa
lComponents')plt.xlabel('Principal
Components')plt.ylabel('Values')
plt.xticks(ticks=[1,2,3,4,5],labels=['PC1','PC2','PC3','PC4','PC5'])
plt.show()

#linegraphofcumulativevariances
cumulative_variance=np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(10,5))
plt.plot(range(1,len(cumulative_variance)+1),cumulative_variance,marker='o',linestyle='-
-')
plt.title('Cumulative Explained Variance
Ratio')plt.xlabel('Number of Principal
Components')plt.ylabel('CumulativeExplainedVariance
Ratio')plt.grid(True)
plt.show()

importseabornassns

pca_components = pd.DataFrame(pca.components_, columns


=numeric_cols,index=['PC1','PC2','PC3','PC4','PC5'])

plt.figure(figsize=(12,8))
sns.heatmap(pca_components, annot=True,
cmap='coolwarm')plt.title('HeatMapofPCA')
plt.xlabel('Originalfeatures')
plt.ylabel('Principal
Features')plt.show()

Output:
Assignment:4- Implement the Apriori Algorithm using suitable dataset

!pip install mlxtend

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Load the dataset


df = pd.read_csv('Groceries_dataset.csv')

# Converting the dataset into the required format for Apriori


(One-Hot Encoding)
# Each item in the transaction must have its own column, and
transactions are marked with binary (1 or 0)
basket = pd.get_dummies(df.drop('Member_number',
axis=1).stack()).groupby(level=0).sum()

# Apply the Apriori algorithm with a minimum support threshold


frequent_itemsets = apriori(basket, min_support=0.05,
use_colnames=True)

# Generating association rules from the frequent itemsets


rules = association_rules(frequent_itemsets, metric="lift",
min_threshold=1)

# Displaying the results


print(frequent_itemsets)
print(rules)

Output-
support itemsets
0 0.064543 (whole milk)
Empty DataFrame
Columns: [antecedents, consequents, antecedent support,
consequent support, support, confidence, lift, leverage,
conviction, zhangs_metric]
Index: []

You might also like