0% found this document useful (0 votes)

83 views92 pages

Kunal DS

This document discusses data pre-processing techniques including chi-square tests and correlation. It provides descriptions of chi-square distribution, degrees of freedom, and how to perform a chi-square test. Steps to calculate correlation coefficients are also outlined. The document then applies these techniques to example datasets to analyze relationships between variables and check for independence.

Uploaded by

Vipul Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views92 pages

Kunal DS

Uploaded by

Vipul Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 92

Practical No: 1

Title: Data pre-processing-chi square &

correlation
Roll No:4808

Description
Chi-squared test, a statistical method, is used by machine learning
methods to check the correlation between two categorical
variables.

Chi-Square distribution
A random variable ꭓ follows chi-square distribution if it can be
written as a sum of squared standard normal variables.

Degrees of freedom
Degrees of freedom refers to the maximum number of logically
independent values, which have the freedom to vary.

A chi-square test is used in statistics to test the independence of

two events. Given the data of two variables, we can get observed
count O and expected count E. Chi-Square measures how
expected count E and observed count O deviates each other.
Steps to perform the Chi-Square Test
● Define Hypothesis.
● Build a Contingency table.
● Find the expected values.
● Calculate the Chi-Square statistic.
● Accept or Reject the Null Hypothesis.

Role/Importance
The Chi-square test is intended to test how likely it is that an
observed distribution is due to chance. It is also called a
“goodness of fit” statistic,because it measures how well the
observed distribution of data fits with the distribution that is
expected if the variables are independent.

A Chi-square test is designed to analyze categorical data. That

means that the
data has been counted and divided into categories.

5% level of significance – prob of rejecting null hypothesis when it

is true

Problem

Solution
First install the R library

Dataset

Program
Output

Dataset –

Program –

Output -
Interpretation
Null hypothesis H0:
Service and Salary are independent

Alternative Hypothesis H1:

Service and Salary are dependent

p-value: 0.2796
p- value is greater than 0.05

Conclusion
Null Hypothesis is accepted
Hence there is no relationship between the service provided and
salary.

Data Preparation - Correlation coefficient

Description

Correlation test is used to evaluate the association between two

or more variables.

The variables may be two columns of a given data set of

observations, often called a sample, or two components of a
multivariate random variable with a known distribution.
The Pearson product-moment correlation coefficient is a measure
of the strength and direction of the linear relationship between
two variables that is defined as the covariance of the variables
divided by the product of their standard deviations. This is the
best-known and most commonly used type of correlation
coefficient.

Strength: The greater the absolute value of the correlation

coefficient, the stronger the relationship.

Direction: The sign of the correlation coefficient represents the

direction of the
relationship.

The values range between -1.0 and 1.0.

A calculated number greater than 1.0 or less than -1.0 means
that there was an error in the correlation measurement.
A correlation of -1.0 shows a perfect negative correlation,
while a correlation of 1.0 shows a perfect positive
correlation.
A correlation of 0.0 shows no relationship between the
movement of the two variables.

First install the R library

Dataset –

Program –

Output –
Interpretation

Conclusion -
As the age increases the glucose count increases

Practical no:2
Title: PCA
Roll No: 4808
1. Import dataset in R
data<-read.csv('C:/kunal_ganjale_4808_Ty/DS/code/
heart.csv')

2. Print values of dataset

head(data)

3. Calculate principle component analysis

require(stats)
pc <- prcomp(x = data[-1],
center = TRUE,
scale. = FALSE)
print(pc)
summary(pc)
4. Plot the PCA’s
require(ggbiplot)
ggbiplot(results)
Conclusion:
More variations are observed in PC1 than PC2. We can give more
importance to target, thalach, slope, cp because they show more
variance and less covariance.
Practical No:3
Title: Exploratory data analysis
Roll No:4808

Exploratory Data Analysis

 Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to
spot anomalies, to test hypothesis and to check assumptions
with the help of summary statistics and graphical
representations.
 At a high level, EDA is the practice of describing the data by
means of statistical and visualization techniques to bring
important aspects of that data into focus for further
analysis.
 This involves looking at your data set from many angles,
describing it, and summarizing it without making any
assumptions about its contents.
 This is a significant step to take before dExploratory iving
into machine learning or statistical modeling, to make sure
the data are really what they are claimed to be and that
there are no obvious problems.

Data set
The datasets consist of several medical predictor (independent)
variables and one target (dependent) variable, Outcome.
Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
Data Description
 Pregnancies: Number of times pregnant
 Glucose: Plasma glucose concentration 2 hours in an oral
glucose tolerance test
 Blood Pressure: Diastolic blood pressure (mm Hg)
 Skin Thickness: Triceps skin fold thickness (mm)
 Insulin: 2-Hour serum insulin (mu U/ml)
 BMI: Body mass index (weight in kg/(height in m)^2)
 Diabetes Pedigree Function: Diabetes pedigree function
 Age: Age (years)
 Outcome: Class variable (0 or 1) 1 indicates diabetes is
present

Case1: diabetes patient analysis

Code:
diabet<-
read.csv('C:/kunal_ganjale_4808_Ty/DS/code/diabetes.csv')
head(diabet)
str(diabet)
summary(diabet)
diabet[1:10,]
diabet[,1:2]
diabet[1:10,1:2]
newdata1<-subset(diabet,diabet$Outcome=="1")
newdata1
newdata2<-subset(diabet,diabet$Pregnancies=="1"
&diabet$Outcome=="1")
newdata2
newdata3<-subset(diabet,diabet$Pregnancies=="1" |
diabet$Outcome=="0",select=c(1,2))
newdata3
newdata4<-diabet[order(diabet$BMI), ]
newdata4
newdata5<-diabet[order(-diabet$BMI),]
newdata5
newdata6<-aggregate(BMI~Outcome,data=diabet,FUN=mean)
newdata6
names(diabet)
colSums(is.na(diabet))
hist(diabet$BMI,col='RED')
plot(diabet$BMI)
boxplot(diabet$BMI)
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)
boxplot(newdata2$SkinThickness)

attach(diabet)
diabet
class(BMI)
table(Outcome)
count<-table(Outcome)
barplot(count,col=2)
pie(count)
table(Pregnancies)
count<-table(Pregnancies)
barplot(count)
pie(count)

output:
Conclusion:

Here we can see the BMI feature of our dataset showing

symmetric distribution as the histogram produces symmetric
curve over bars of histogram.
As the distribution of points are welly scattered on plot we can
conclude BMI produces symmetric distribution.

There are more diabetic people in age range 30-50

there are less diabetic patient in age group 20-25
so we conclude that age matters diabetics.

Pregnancies

Diabetics
Non-diabetics

Here we can see that very minute chance to get diabetics is

increases with no of pregnancies
Blood pressure ranges from 60-90 in diabetic’s patients

Blood pressure ranges from 60-90 in non-diabetic’s patients

Glucose in diabetics

Glucose in non-diabetics

Here we can observe chance to get diabetics increases with

increate in glucose level.
This graph gives number of diabetic and non diabetic persons
This graph give women’s and the number of pregnancies they had

Case2: car acceleration and mpg analysis

Code:
data<- read.csv('D:/college/sem_6/data science/code/auto-
mpg.csv')
head(data)
str(data)
summary(data)
library(dplyr)
newdata1<-subset(data,data$mpg <=40)
newdata1<-subset(newdata1,newdata1$acceleration <=21
&newdata1$acceleration >=9)
newdata2<-subset(newdata1,newdata1$mpg <=35)
newdata2
hist(newdata2$weight,col='RED')
boxplot(data$mpg)
boxplot(newdata1$mpg)
boxplot(data$acceleration)
boxplot(newdata1$acceleration)
boxplot(data$weight)
boxplot(newdata1$mpg)
newdata2<-subset(data,data$cylinders==8
&data$modelyear==70)
newdata1
names(data)
colSums(is.na(data))
hist(data$mpg,col='RED')
plot(data$acceleration)
boxplot(data$acceleration)
mean(data$acceleration)
median(data$acceleration)
max(data$acceleration)
min(data$acceleration)

attach(newdata1)
class(cylinders)
table(newdata1$modelyear)
count<-table(newdata1$modelyear)
barplot(count,col=2)
pie(count)
table(newdata1$origin)
count<-table(newdata1$origin)
barplot(count)
pie(count)

Output:
Boxplot acceleration

Boxplot mpg
Box plot of weight

After removal of outlier

Boxplot acceleration
Boxplot mpg

Conclusion:
Acceleration shows symmetric data distribution
for range 14-16 acceleration gives maximum milage

For weight range 2000-3000 cars gives maximum milage

Maximum car are generated by country1
Practical No: 4
Title: Decision Tree
Roll No: 4808

A decision tree is a decision support tool that uses a tree-like

graph or model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional
control statements.
A decision tree is a flowchart-like structure in which each internal
node represents a “test” on an attribute (e.g. whether a coin flip
comes up heads or tails), each branch represents the outcome of
the test, and each leaf node represents a class label (decision
taken after computing all attributes). The paths from root to leaf
represent classification rules.
Tree based learning algorithms are considered to be one of the
best and mostly used supervised learning methods. Tree based
methods empower predictive models with high accuracy, stability
and ease of interpretation. Unlike linear models, they map non-
linear relationships quite well. They are adaptable at solving any
kind of problem at hand (classification or regression). Decision
Tree algorithms are referred to as CART (Classification and
Regression Trees).

Common terms used with Decision trees:

 Root Node: It represents entire population or sample and
this further gets divided into two or more homogeneous
sets.
 Splitting: It is a process of dividing a node into two or more
sub-nodes.
 Decision Node: When a sub-node splits into further sub-
nodes, then it is called decision node.
 Leaf/ Terminal Node: Nodes do not split is called Leaf or
Terminal node.
 Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of
splitting.
 Branch / Sub-Tree: A sub section of entire tree is called
branch or sub-tree.
 Parent and Child Node: A node, which is divided into sub-
nodes is called parent node of sub-nodes whereas sub-nodes
are the child of parent node.

Heart Disease dataset

Code:

heart<-read.csv('D:/college/sem_6/data science/code/heart.csv')
summary(heart)
names(heart)
library(partykit)
heart$target<-as.factor(heart$target)#convert to categorical
summary(heart$target)
names(heart)
set.seed(1234)
pd<-sample(2,nrow(heart),replace = TRUE, prob=c(0.8,0.2))#two
samples with distribution 0.8 and 0.2
trainingset<-heart[pd==1,]#first partition
validationset<-heart[pd==2,]#second partition
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset)
class(heart$target)
plot(tree)
#Prunning
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset,control=ctree_control(mincriterion =
0.99,minsplit = 500))
plot(tree)
pred<-predict(tree,validationset,type="prob")
pred
pred<-predict(tree,validationset)
pred
library(caret)
confusionMatrix(pred,validationset$target)
pred<-predict(tree,validationset,type="prob")
pred
library(pROC)
plot(roc(validationset$target,pred[ ,2]))
library(rpart)
fit <- rpart(target ~ age + sex + cp + trestbps + chol + fbs + restecg +
thalach + exang + oldpeak + slope + ca +
thal,data=titanic,method="class")
plot(fit)
text(fit)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(fit)
Prediction <- predict(fit, titanic, type = "class")
Prediction

Output:
Conclusion:
 Accuracy: 73.36%
 Green color in tree = false for heart disease
 Blue color in tree = true for heart disease
 Darker shades mean more true/false for heart disease
 The ROC graph shows that the model is not very accurate as
sensitivity and specificity are almost same.
 The true positive is not high enough so accuracy is medium.
 Specificity & sensitivity should be greater than 80 for proper
accuracy but only sensitivity is greater than 80.
partykit: A Toolkit for Recursive Partitioning
A toolkit with infrastructure for representing, summarizing, and
visualizing tree-structured regression and classification models.
This unified infrastructure can be used for reading/coercing tree
models from different sources ('rpart', 'RWeka', 'PMML') yielding
objects that share functionality for print ()/plot ()/predict ()
methods.
Caret:
The caret package (short for Classification and Regression
Training) contains functions to streamline the model training
process for complex regression and classification problems.
pROC
pROC is a set of tools to visualize, smooth and compare receiver
operating characteristic (ROC curves). (Partial) area under the
curve (AUC) can be compared with statistical tests based on U-
statistics or bootstrap. Confidence intervals can be computed for
(p)AUC or ROC curves.

Rattle
A package written in R providing a graphical user interface to
very many other R packages that provide functionality for data
mining.
Data.table
Data. table is an extension of data. frame package in R. It is widely
used for fast aggregation of large datasets, low latency
add/update/remove of columns, quicker ordered joins, and a fast
file reader.
rpart.plot
This function combines and extends plot. rpart and text. rpart in
the rpart package. It automatically scales and adjusts the
displayed tree for best fit.
practical No: 5
Title: Clustering
Roll No: 4808

1. Import dataset
data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/
coordinate.csv')
data<-data[1:150,]
names (data)

2. Making subset containing x feature

3. Checking outliers using boxplot

For X
For Y

4. Calculating K-means to make 2 clusters

cl<-kmeans(new_data, 2)
cl
5. Calculate WSS
data<-new_data
wss<-sapply(1:15, function(k){kmeans(data,k)}$tot.withinss)
wss

6. Plot elbow graph

plot(1:15, wss, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters",ylab= "total within-clusters sum of
squares")
7. Silhouette graph
library(factoextra)
fviz_nbclust(data, pam, method = "silhouette")
8. Plot clusters
library(cluster)
clusplot(new_data, cl$cluster, color=TRUE, shade=TRUE,
labels=FALSE, lines=0)
9. Classification of points based on cluster
cl$cluster
cl$centers

10. Hierarchical clustering based on y feature

clusters <- hclust(dist(data[, 0:1]), method = 'average')
clustercut1 <- cutree(clusters, 2)
table(clustercut1, data$y)
plot(clusters)

ggplot(data, aes(X, y)) +

geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clustercut1) +
scale_color_manual (values = c("red", "green","black","blue"))
11. Hierarchical clustering based on x feature
clusters <- hclust(dist(data[, 0:1]))
plot(clusters)
clustercut <- cutree(clusters, 2)
table(clustercut, data$X)

library(ggplot2)
ggplot(data, aes(X,y)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clustercut) +
scale_color_manual (values = c("red", "green","black","blue"))
12. DBSCAN clustering
Conclusion:
As we can see the box plot of both x and y feature there is no
outliers were present, thus no need to construct features.
Both elbow method and silhouette methods give 2 cluster as
optimal clusters, thus we can use to make to clusters
We construct clusters using k means, Hierarchical and DBSCAN
clustering method but K means clusters shows good
representation of clustering data than remaining both
It is found that clusters are made using x feature as prime aspect.
As we can see in below figure as x value increases then second
cluster is got formed, we conclude that by observing y feature in
both clusters contain high value but first cluster have low x
feature values than second cluster, thus in this case x feature is
prime aspect for clustering.
Practical No:6
Title: Association
Roll No:4808

Association:
Association is a data mining technique that discovers the
probability of the co-occurrence of items in a collection. The
relationships between co-occurring items are expressed as
Association Rules. Association rule mining finds interesting
associations and relationships among large sets of data items.
Association rules are "if-then" statements, that help to show the
probability of relationships between data items, within large data
sets in various types of databases. Here the If element is called
antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or
relation between two items is known as single cardinality.
Association rule mining has a number of applications and is
widely used to help discover sales correlations in transactional
data or in medical data sets.

Apriori:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994
for finding frequent itemsets in a dataset for boolean association
rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used
to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent
itemsets, an important property is used called Apriori property
which helps by reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must
be frequent.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow.
The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum
support or large itemsets i.e. it is not an efficient approach for
large number of datasets. It will check for many sets from
candidate itemsets, also it will scan database many times
repeatedly for finding candidate itemsets. Apriori will be very low
and inefficiency when memory capacity is limited with large
number of transactions.
Algorithm
 Calculate the support of item sets (of size k = 1) in the
transactional database (note that support is the frequency of
occurrence of an itemset). This is called generating the
candidate set.
 Prune the candidate set by eliminating items with a support
less than the given threshold.
 Join the frequent itemsets to form sets of size k + 1, and
repeat the above sets until no more itemsets can be formed.
This will happen when the set(s) formed have a support less
than the given support.
OR
1. Set a minimum support and confidence.
2. Take all the subset present in the transactions which have
higher support than minimum support.
3. Take all the rules of these subsets which have higher
confidence than minimum confidence.
4. Sort the rules by decreasing lift.
Components of Apriori
Support and Confidence:
Support refers to items’ frequency of occurrence i.e. x and y items
are purchased together, confidence is a conditional probability
that y item is purchased given that x item is purchased.
Support( I )=( Number of transactions containing item I ) / ( Total
number of transactions )
Confidence( I1 -> I2 ) =( Number of transactions containing I1 and
I2 ) / ( Number of transactions containing I1 )

Lift:
Lift gives the correlation between A and B in the rule A=>B.
Correlation shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule
can be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the
degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Lift( I1 -> I2 ) = ( Confidence( I1 -> I2 ) / ( Support(I2) )
Coverage:
Coverage (also called cover or LHS-support) is the support of the
left-hand-side of the rule X => Y , i.e., supp(X).
It represents a measure of to how often the rule can be applied.
Coverage can be quickly calculated from the rule’s quality
measures (support and confidence)
Fp tree:
The FP-Growth Algorithm proposed by Han in The FP-Growth
Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so
much, it uses a divide-and-conquer strategy. The core of this
method is the usage of a special data structure named frequent-
pattern tree (FP-tree), which retains the item set association
information. This tree-like structure is made with the initial
itemsets of the database. The purpose of the FP tree is to mine the
most frequent pattern. Each node of the FP tree represents an
item of the itemset.
The root node represents null while the lower nodes represent
the itemsets. The association of the nodes with the lower nodes
that is the itemsets with the other itemsets are maintained while
forming the tree.
Algorithm:
Building the tree

Find Patterns Having p From P-conditional Database

Calculate conditional frequent pattern tree.

Dataset: supermarket.csv

1. Import libraries
library(arules)
library(arulesViz)
library(RColorBrewer)

2. Import dataset
data<-read.transactions('D:/college/sem_6/data
science/code/supermarket.csv', rm.duplicates= TRUE,
format="single",sep=",",header =
TRUE,cols=c("Branch","Product line"))
#data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/Super
Store.csv')
#data <- subset(data, select = c(0,1))

3. Display structure of data

str(data)
4. Items and transaction ids
inspect(head(data))

5. Labels of items
data@itemInfo$labels

6. Generating rules
data_rules <- apriori(data, parameter = list(supp = 0.01, conf =
0.2))
data_rules
7. Inspect rules
inspect(data_rules[1:20])

8. Inspect top 10 rules

inspect(head(sort(data_rules, by = "confidence"), 10))

9. Inspect bottom 10 rules

inspect(tail(sort(data_rules, by = "confidence"), 10))
10. Determine rules which reach to fashion
accessories
fashion_rules <- apriori(data=data, parameter=list
(supp=0.001,conf = 0.08), appearance = list (rhs="Fashion
accessories"))

inspect(head(sort(fashion_rules, by = "confidence"), 10))

11. Determine rules which reach to fashion
accessories with increased support
fashion_rules_increased_support <- apriori(data, parameter =
list(support =0.02, confidence = 0.5))

inspect(head(sort(fashion_rules_increased_support, by =
"confidence"), 10))

12. Plot absolute item frequency graph

itemFrequencyPlot(data,topN=20,type="absolute",col=brewer.
pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
Practical No: 7
Title: Time Series
Roll No:4808

1. Import CSV file

data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/
income1.csv')
attach(data)
head(data)

2. Assign x variable for month and y for total filled jobs

x<-Month
y<-Total.Filled.Jobs
3. Calculate difference between pairs of consecutive element
vectors
d.y<-diff(y)

4. Calculate and plot acf

acf(y)
5. Calculate and plot pacf
pacf(y)

6. Calculate and plot acf of d.y

acf(d.y)
7. Generate arima model for y variable based on month
arima(y,order = c(1,0,0))

8. Store arima values in table

mydata.arima001<-arima(y,order=c(0,0,1))
9. Predict 100 values using arima values
mydata.pred01<-predict(mydata.arima001,n.ahead = 100)
head(mydata.pred01)

10. Plot y values

Plot(y)
11. Display head and tail of predicted values from
prediction table
tail(mydata.pred01$pred)
head(mydata.pred01$pred)

Conclusion:
As we can see trend of flow of y parameter as blue line we can
conclude that model is predicting values as per trend
Practical No:8
Title: MongoDB
Roll No:4808

1. Extract mongodb zip file in C;\ drive

2. create folder data in C:\
3. Create folder db in C:\data\db
4. Goto C:\mongodb\bin and click on mongod.exe and keep
server running
Click on mongo.exe
Create database kunal

Create table student and insert records

> db.student.insert({name:"kunal",age:22,address:
[{city:"mumbai"},{pin:400614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Sajjad",age:22,address:
[{city:"Dombivli"},{pin:401614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Pankaj",age:21,address:[{city:"Pune"},
{pin:406721}]})
WriteResult({ "nInserted" : 1 })
db.student.insert({name:"Akshay",age:24,address:[{city:"Pune"},
{pin:456765}]})
db.student.insert({name:"Yash",age:21,address:[{city:"Satara"},
{pin:345234}]})

Display inserted records

Create table student_mark and insert records in it
> db.student_mark.insert({name:"kunal",marks:[{physics:79},
{chem:89},{bio:87}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Sajjad",marks:[{physics:90},
{chem:79},{bio:84}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Pankaj",marks:[{physics:76},
{chem:89},{bio:67}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Akshay",marks:[{physics:63},
{chem:78},{bio:88}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Yash",marks:[{physics:71},
{chem:55},{bio:65}]})
WriteResult({ "nInserted" : 1 })
Display record in json format
db.student.find().forEach(printjson)

display details of student who’s age is greater than 22

> db.student.find({age:{$gt:22}}).pretty()
display details of student who’s city is pune

db.student.find({'address.city':'Pune'}).pretty()
Display student who got more than 84 marks in physics

Display student who got less than 85 marks in physics

db.student_mark.find({'marks.bio':{$lte:84}}).pretty()
Display students who lives in pune or mumbai and whos age
is greatter than 21

db.student.find({'address.city':{$in:["Pune","mumbai"]},age:
{$gte:21}}).pretty()
Display students who got more than 70 marks in all subject
db.student_mark.find({'marks.bio':{$gte:70},'marks.physics':
{$gte:70},'marks.chem':{$gte:70}}).pretty()

Display student who got 84 marks in bio

db.student_mark.find({'marks.bio':84}).pretty()

Update student name to Anurag who got 84 marks in

bio
db.student_mark.update({'marks.bio':84},{$set:
{'name':'anurag'}})
Remove student who’s name is anurag
db.student_mark.remove({'name':'anurag'},1)

Delete collection
db.student_mark.drop()

Drop table
db.dropDatabase()
Practical No:9
Title: topic Modelling
Roll No:4808

Load all the text files from folder

library(tm)
library(topicmodels)
setwd("C:/british-fiction-corpus")
filenames<-list.files(path="C:/british-fiction-
corpus",pattern="*.txt")
filenames

Find all word in all files with the specific length

filetext<-lapply(filenames,readLines)#lapply returns a list of the
same length as X, applying FUN to the corresponding element of X.
mycorpus<-Corpus(VectorSource(filetext))# VectorSource
interprets each element of the vector x as a document.
mycorpus<-tm_map(mycorpus,removeNumbers)
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus

provide list of stopwords to fin in each text files and

map it on words from files
mystopwords=c("of","a","and","the","in","to","for","that","is","on",
"are","with","as","by"
,"be","an","which","it","from","or","can","have","these","ha
s","such","you")
mycorpus<-tm_map(mycorpus,tolower)
mycorpus<-tm_map(mycorpus,removeWords,mystopwords)
dtm<-DocumentTermMatrix(mycorpus)
k<-3

#lda_output_3<-LDA(dtm,k,method="VEM",control=control_VEM)
# control_VEM

#lda_output_3<-LDA(dtm,k,method="VEM",control=NULL)
lda_output_3<-LDA(dtm,k,method="VEM")

#lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3@Dim
#lda_output_3<-LDA(dtm,k,method="VEM")
#show (dtm)

topics(lda_output_3)
terms(lda_output_3,10)
Output:

Conclusion:
The keywords used in all the texts files are more suitable for
natural language processing (NPL)

ANOVA (Analysis of Variance) : NOVA Example
No ratings yet
ANOVA (Analysis of Variance) : NOVA Example
15 pages
Becs-184 em 2024-25 KP@7736848424
No ratings yet
Becs-184 em 2024-25 KP@7736848424
29 pages
L6 - Biostatistics - Linear Regression and Correlation
No ratings yet
L6 - Biostatistics - Linear Regression and Correlation
121 pages
RM Lab
No ratings yet
RM Lab
59 pages
Biostat Finals Part 1
No ratings yet
Biostat Finals Part 1
3 pages
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
No ratings yet
Choosing Appropriate Descriptive Statistics, Graphs and Statistical Tests
47 pages
Class Notes
No ratings yet
Class Notes
38 pages
3 s2.0 B9780443185502200017 Main
No ratings yet
3 s2.0 B9780443185502200017 Main
11 pages
DT 444
No ratings yet
DT 444
19 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
DA Lab Manual
No ratings yet
DA Lab Manual
60 pages
Statistics in 1 Hour
No ratings yet
Statistics in 1 Hour
19 pages
Practical 02 - Pca
No ratings yet
Practical 02 - Pca
14 pages
Correlation and Regration
No ratings yet
Correlation and Regration
57 pages
Research Methods Chapter 5
No ratings yet
Research Methods Chapter 5
59 pages
Maths Project
No ratings yet
Maths Project
11 pages
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
No ratings yet
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
53 pages
Bioe Week 7&8
No ratings yet
Bioe Week 7&8
12 pages
Introduction of Correlation
No ratings yet
Introduction of Correlation
39 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Wa0019.
No ratings yet
Wa0019.
34 pages
50 Kash Sharma Maths
No ratings yet
50 Kash Sharma Maths
13 pages
Phi Tham So - Dofile
No ratings yet
Phi Tham So - Dofile
5 pages
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
6.1 Test For Single Mean: Assumptions
No ratings yet
6.1 Test For Single Mean: Assumptions
17 pages
Ruhil Future Technologies
No ratings yet
Ruhil Future Technologies
13 pages
Statistical Treatment of Data
No ratings yet
Statistical Treatment of Data
19 pages
T-Test Practical
No ratings yet
T-Test Practical
31 pages
Session 6 Lecture 7 20220215 Slides
No ratings yet
Session 6 Lecture 7 20220215 Slides
13 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
R Commands New 2
No ratings yet
R Commands New 2
23 pages
Aih Exp 3
No ratings yet
Aih Exp 3
8 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
17 pages
Assignment# 06
No ratings yet
Assignment# 06
16 pages
Coeficiente de Correlação
No ratings yet
Coeficiente de Correlação
6 pages
Labsheet8 241206 181419
No ratings yet
Labsheet8 241206 181419
6 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
STAT-01 Finals
No ratings yet
STAT-01 Finals
3 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
ProbList5 24 SLN
No ratings yet
ProbList5 24 SLN
9 pages
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
No ratings yet
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
28 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
DWDM
No ratings yet
DWDM
18 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
LAB 6 Biostat and Epi
No ratings yet
LAB 6 Biostat and Epi
5 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
37 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Week 8: Inferential Statistics: Test of Independence
No ratings yet
Week 8: Inferential Statistics: Test of Independence
4 pages
Case Study - Healthcare Industry
No ratings yet
Case Study - Healthcare Industry
2 pages
SolomonAntonioVisuyanTandoyBallartaGumbocAretanoNaive - Ed104 - Pearson R & Simple Regression - April 24, 2021
No ratings yet
SolomonAntonioVisuyanTandoyBallartaGumbocAretanoNaive - Ed104 - Pearson R & Simple Regression - April 24, 2021
13 pages
SPSS Def + Job Description
No ratings yet
SPSS Def + Job Description
54 pages
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
No ratings yet
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
53 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
Research Article
No ratings yet
Research Article
10 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
21 pages
Spearman Rho
0% (1)
Spearman Rho
26 pages
Bivariate Data Notes
No ratings yet
Bivariate Data Notes
14 pages
Procstat
No ratings yet
Procstat
494 pages
Psych Stats Reviewer
No ratings yet
Psych Stats Reviewer
35 pages
(Ebook) Social Statistics: Managing Data, Conducting Analyses, Presenting Results by Thomas J. Linneman ISBN 9780415661461, 0415661463 Download
100% (1)
(Ebook) Social Statistics: Managing Data, Conducting Analyses, Presenting Results by Thomas J. Linneman ISBN 9780415661461, 0415661463 Download
58 pages
An Introduction To Bivariate Correlation Analysis in SPSS: IQ, Income, and Voting
No ratings yet
An Introduction To Bivariate Correlation Analysis in SPSS: IQ, Income, and Voting
6 pages
International Journal of Management (Ijm) : ©iaeme
No ratings yet
International Journal of Management (Ijm) : ©iaeme
8 pages
A Study On The Performance of Insurance Companies in 1xynrowx1f
No ratings yet
A Study On The Performance of Insurance Companies in 1xynrowx1f
13 pages
Correlation and Linear
No ratings yet
Correlation and Linear
68 pages
How To Calculate A Correlation
No ratings yet
How To Calculate A Correlation
5 pages
Microfit Tutorial Intro
100% (1)
Microfit Tutorial Intro
20 pages
Group Assignment Questions Landscape
No ratings yet
Group Assignment Questions Landscape
14 pages
RBP Speaking
No ratings yet
RBP Speaking
9 pages
Cea11 14836828
No ratings yet
Cea11 14836828
14 pages
Psychological Achievement Test
No ratings yet
Psychological Achievement Test
11 pages
The Effect of Chinese Education On The Grade 6 Students of Baguio Patriotic High School On Their Academic Performance
No ratings yet
The Effect of Chinese Education On The Grade 6 Students of Baguio Patriotic High School On Their Academic Performance
20 pages
Mini Quantitative Research
No ratings yet
Mini Quantitative Research
25 pages
Regression and Correlation Analysisxy
No ratings yet
Regression and Correlation Analysisxy
23 pages
B A B SC Three Year Degree Course W e F 2012-2013
No ratings yet
B A B SC Three Year Degree Course W e F 2012-2013
30 pages
Pombo - 2017 - Hydrological Sciences Journal Angola
No ratings yet
Pombo - 2017 - Hydrological Sciences Journal Angola
18 pages
Calculating The Pearsons Sample Correlation Coefficient
No ratings yet
Calculating The Pearsons Sample Correlation Coefficient
19 pages
A Research Presented To The
No ratings yet
A Research Presented To The
40 pages
5 6107033548873532058
No ratings yet
5 6107033548873532058
29 pages
Discretionary Revenues As A Measure of Earnings Management
No ratings yet
Discretionary Revenues As A Measure of Earnings Management
22 pages
Il 943
No ratings yet
Il 943
14 pages
Students' Perceptions of Accounting Profession: Work Value Approach
No ratings yet
Students' Perceptions of Accounting Profession: Work Value Approach
13 pages
E-Learning: A Study On Secondary Students' Attitudes Towards Online Web Assisted Learning
No ratings yet
E-Learning: A Study On Secondary Students' Attitudes Towards Online Web Assisted Learning
15 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet

Kunal DS

Uploaded by

Kunal DS

Uploaded by

Practical No: 1

Title: Data pre-processing-chi square &

A chi-square test is used in statistics to test the independence of

A Chi-square test is designed to analyze categorical data. That

5% level of significance – prob of rejecting null hypothesis when it

Alternative Hypothesis H1:

Data Preparation - Correlation coefficient

Correlation test is used to evaluate the association between two

The variables may be two columns of a given data set of

Strength: The greater the absolute value of the correlation

Direction: The sign of the correlation coefficient represents the

The values range between -1.0 and 1.0.

First install the R library

2. Print values of dataset

3. Calculate principle component analysis

Exploratory Data Analysis

Case1: diabetes patient analysis

Here we can see the BMI feature of our dataset showing

There are more diabetic people in age range 30-50

Here we can see that very minute chance to get diabetics is

Blood pressure ranges from 60-90 in non-diabetic’s patients

Here we can observe chance to get diabetics increases with

Case2: car acceleration and mpg analysis

After removal of outlier

For weight range 2000-3000 cars gives maximum milage

A decision tree is a decision support tool that uses a tree-like

Common terms used with Decision trees:

Heart Disease dataset

2. Making subset containing x feature

3. Checking outliers using boxplot

4. Calculating K-means to make 2 clusters

6. Plot elbow graph

10. Hierarchical clustering based on y feature

ggplot(data, aes(X, y)) +

Find Patterns Having p From P-conditional Database

Calculate conditional frequent pattern tree.

3. Display structure of data

8. Inspect top 10 rules

9. Inspect bottom 10 rules

inspect(head(sort(fashion_rules, by = "confidence"), 10))

12. Plot absolute item frequency graph

1. Import CSV file

2. Assign x variable for month and y for total filled jobs

4. Calculate and plot acf

6. Calculate and plot acf of d.y

8. Store arima values in table

10. Plot y values

1. Extract mongodb zip file in C;\ drive

Create table student and insert records

Display inserted records

display details of student who’s age is greater than 22

Display student who got less than 85 marks in physics

Display student who got 84 marks in bio

Update student name to Anurag who got 84 marks in

Load all the text files from folder

Find all word in all files with the specific length

provide list of stopwords to fin in each text files and

You might also like