Kunal DS
Kunal DS
Description
Chi-squared test, a statistical method, is used by machine learning
methods to check the correlation between two categorical
variables.
Chi-Square distribution
A random variable ꭓ follows chi-square distribution if it can be
written as a sum of squared standard normal variables.
Degrees of freedom
Degrees of freedom refers to the maximum number of logically
independent values, which have the freedom to vary.
Role/Importance
The Chi-square test is intended to test how likely it is that an
observed distribution is due to chance. It is also called a
“goodness of fit” statistic,because it measures how well the
observed distribution of data fits with the distribution that is
expected if the variables are independent.
Problem
Solution
First install the R library
Dataset
Program
Output
Dataset –
Program –
Output -
Interpretation
Null hypothesis H0:
Service and Salary are independent
p-value: 0.2796
p- value is greater than 0.05
Conclusion
Null Hypothesis is accepted
Hence there is no relationship between the service provided and
salary.
Dataset –
Program –
Output –
Interpretation
Conclusion -
As the age increases the glucose count increases
Practical no:2
Title: PCA
Roll No: 4808
1. Import dataset in R
data<-read.csv('C:/kunal_ganjale_4808_Ty/DS/code/
heart.csv')
Data set
The datasets consist of several medical predictor (independent)
variables and one target (dependent) variable, Outcome.
Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
Data Description
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral
glucose tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Diabetes Pedigree Function: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 1 indicates diabetes is
present
attach(diabet)
diabet
class(BMI)
table(Outcome)
count<-table(Outcome)
barplot(count,col=2)
pie(count)
table(Pregnancies)
count<-table(Pregnancies)
barplot(count)
pie(count)
output:
Conclusion:
Pregnancies
Diabetics
Non-diabetics
Glucose in non-diabetics
attach(newdata1)
class(cylinders)
table(newdata1$modelyear)
count<-table(newdata1$modelyear)
barplot(count,col=2)
pie(count)
table(newdata1$origin)
count<-table(newdata1$origin)
barplot(count)
pie(count)
Output:
Boxplot acceleration
Boxplot mpg
Box plot of weight
Conclusion:
Acceleration shows symmetric data distribution
for range 14-16 acceleration gives maximum milage
heart<-read.csv('D:/college/sem_6/data science/code/heart.csv')
summary(heart)
names(heart)
library(partykit)
heart$target<-as.factor(heart$target)#convert to categorical
summary(heart$target)
names(heart)
set.seed(1234)
pd<-sample(2,nrow(heart),replace = TRUE, prob=c(0.8,0.2))#two
samples with distribution 0.8 and 0.2
trainingset<-heart[pd==1,]#first partition
validationset<-heart[pd==2,]#second partition
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset)
class(heart$target)
plot(tree)
#Prunning
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset,control=ctree_control(mincriterion =
0.99,minsplit = 500))
plot(tree)
pred<-predict(tree,validationset,type="prob")
pred
pred<-predict(tree,validationset)
pred
library(caret)
confusionMatrix(pred,validationset$target)
pred<-predict(tree,validationset,type="prob")
pred
library(pROC)
plot(roc(validationset$target,pred[ ,2]))
library(rpart)
fit <- rpart(target ~ age + sex + cp + trestbps + chol + fbs + restecg +
thalach + exang + oldpeak + slope + ca +
thal,data=titanic,method="class")
plot(fit)
text(fit)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(fit)
Prediction <- predict(fit, titanic, type = "class")
Prediction
Output:
Conclusion:
Accuracy: 73.36%
Green color in tree = false for heart disease
Blue color in tree = true for heart disease
Darker shades mean more true/false for heart disease
The ROC graph shows that the model is not very accurate as
sensitivity and specificity are almost same.
The true positive is not high enough so accuracy is medium.
Specificity & sensitivity should be greater than 80 for proper
accuracy but only sensitivity is greater than 80.
partykit: A Toolkit for Recursive Partitioning
A toolkit with infrastructure for representing, summarizing, and
visualizing tree-structured regression and classification models.
This unified infrastructure can be used for reading/coercing tree
models from different sources ('rpart', 'RWeka', 'PMML') yielding
objects that share functionality for print ()/plot ()/predict ()
methods.
Caret:
The caret package (short for Classification and Regression
Training) contains functions to streamline the model training
process for complex regression and classification problems.
pROC
pROC is a set of tools to visualize, smooth and compare receiver
operating characteristic (ROC curves). (Partial) area under the
curve (AUC) can be compared with statistical tests based on U-
statistics or bootstrap. Confidence intervals can be computed for
(p)AUC or ROC curves.
Rattle
A package written in R providing a graphical user interface to
very many other R packages that provide functionality for data
mining.
Data.table
Data. table is an extension of data. frame package in R. It is widely
used for fast aggregation of large datasets, low latency
add/update/remove of columns, quicker ordered joins, and a fast
file reader.
rpart.plot
This function combines and extends plot. rpart and text. rpart in
the rpart package. It automatically scales and adjusts the
displayed tree for best fit.
practical No: 5
Title: Clustering
Roll No: 4808
1. Import dataset
data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/
coordinate.csv')
data<-data[1:150,]
names (data)
library(ggplot2)
ggplot(data, aes(X,y)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clustercut) +
scale_color_manual (values = c("red", "green","black","blue"))
12. DBSCAN clustering
Conclusion:
As we can see the box plot of both x and y feature there is no
outliers were present, thus no need to construct features.
Both elbow method and silhouette methods give 2 cluster as
optimal clusters, thus we can use to make to clusters
We construct clusters using k means, Hierarchical and DBSCAN
clustering method but K means clusters shows good
representation of clustering data than remaining both
It is found that clusters are made using x feature as prime aspect.
As we can see in below figure as x value increases then second
cluster is got formed, we conclude that by observing y feature in
both clusters contain high value but first cluster have low x
feature values than second cluster, thus in this case x feature is
prime aspect for clustering.
Practical No:6
Title: Association
Roll No:4808
Association:
Association is a data mining technique that discovers the
probability of the co-occurrence of items in a collection. The
relationships between co-occurring items are expressed as
Association Rules. Association rule mining finds interesting
associations and relationships among large sets of data items.
Association rules are "if-then" statements, that help to show the
probability of relationships between data items, within large data
sets in various types of databases. Here the If element is called
antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or
relation between two items is known as single cardinality.
Association rule mining has a number of applications and is
widely used to help discover sales correlations in transactional
data or in medical data sets.
Apriori:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994
for finding frequent itemsets in a dataset for boolean association
rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used
to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent
itemsets, an important property is used called Apriori property
which helps by reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must
be frequent.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow.
The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum
support or large itemsets i.e. it is not an efficient approach for
large number of datasets. It will check for many sets from
candidate itemsets, also it will scan database many times
repeatedly for finding candidate itemsets. Apriori will be very low
and inefficiency when memory capacity is limited with large
number of transactions.
Algorithm
Calculate the support of item sets (of size k = 1) in the
transactional database (note that support is the frequency of
occurrence of an itemset). This is called generating the
candidate set.
Prune the candidate set by eliminating items with a support
less than the given threshold.
Join the frequent itemsets to form sets of size k + 1, and
repeat the above sets until no more itemsets can be formed.
This will happen when the set(s) formed have a support less
than the given support.
OR
1. Set a minimum support and confidence.
2. Take all the subset present in the transactions which have
higher support than minimum support.
3. Take all the rules of these subsets which have higher
confidence than minimum confidence.
4. Sort the rules by decreasing lift.
Components of Apriori
Support and Confidence:
Support refers to items’ frequency of occurrence i.e. x and y items
are purchased together, confidence is a conditional probability
that y item is purchased given that x item is purchased.
Support( I )=( Number of transactions containing item I ) / ( Total
number of transactions )
Confidence( I1 -> I2 ) =( Number of transactions containing I1 and
I2 ) / ( Number of transactions containing I1 )
Lift:
Lift gives the correlation between A and B in the rule A=>B.
Correlation shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule
can be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the
degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Lift( I1 -> I2 ) = ( Confidence( I1 -> I2 ) / ( Support(I2) )
Coverage:
Coverage (also called cover or LHS-support) is the support of the
left-hand-side of the rule X => Y , i.e., supp(X).
It represents a measure of to how often the rule can be applied.
Coverage can be quickly calculated from the rule’s quality
measures (support and confidence)
Fp tree:
The FP-Growth Algorithm proposed by Han in The FP-Growth
Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so
much, it uses a divide-and-conquer strategy. The core of this
method is the usage of a special data structure named frequent-
pattern tree (FP-tree), which retains the item set association
information. This tree-like structure is made with the initial
itemsets of the database. The purpose of the FP tree is to mine the
most frequent pattern. Each node of the FP tree represents an
item of the itemset.
The root node represents null while the lower nodes represent
the itemsets. The association of the nodes with the lower nodes
that is the itemsets with the other itemsets are maintained while
forming the tree.
Algorithm:
Building the tree
1. Import libraries
library(arules)
library(arulesViz)
library(RColorBrewer)
2. Import dataset
data<-read.transactions('D:/college/sem_6/data
science/code/supermarket.csv', rm.duplicates= TRUE,
format="single",sep=",",header =
TRUE,cols=c("Branch","Product line"))
#data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/Super
Store.csv')
#data <- subset(data, select = c(0,1))
5. Labels of items
data@itemInfo$labels
6. Generating rules
data_rules <- apriori(data, parameter = list(supp = 0.01, conf =
0.2))
data_rules
7. Inspect rules
inspect(data_rules[1:20])
inspect(head(sort(fashion_rules_increased_support, by =
"confidence"), 10))
Conclusion:
As we can see trend of flow of y parameter as blue line we can
conclude that model is predicting values as per trend
Practical No:8
Title: MongoDB
Roll No:4808
> db.student.find({age:{$gt:22}}).pretty()
display details of student who’s city is pune
db.student.find({'address.city':'Pune'}).pretty()
Display student who got more than 84 marks in physics
db.student.find({'address.city':{$in:["Pune","mumbai"]},age:
{$gte:21}}).pretty()
Display students who got more than 70 marks in all subject
db.student_mark.find({'marks.bio':{$gte:70},'marks.physics':
{$gte:70},'marks.chem':{$gte:70}}).pretty()
Delete collection
db.student_mark.drop()
Drop table
db.dropDatabase()
Practical No:9
Title: topic Modelling
Roll No:4808
#lda_output_3<-LDA(dtm,k,method="VEM",control=control_VEM)
# control_VEM
#lda_output_3<-LDA(dtm,k,method="VEM",control=NULL)
lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3@Dim
#lda_output_3<-LDA(dtm,k,method="VEM")
#show (dtm)
topics(lda_output_3)
terms(lda_output_3,10)
Output:
Conclusion:
The keywords used in all the texts files are more suitable for
natural language processing (NPL)