0% found this document useful (0 votes)
83 views92 pages

Kunal DS

This document discusses data pre-processing techniques including chi-square tests and correlation. It provides descriptions of chi-square distribution, degrees of freedom, and how to perform a chi-square test. Steps to calculate correlation coefficients are also outlined. The document then applies these techniques to example datasets to analyze relationships between variables and check for independence.

Uploaded by

Vipul Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views92 pages

Kunal DS

This document discusses data pre-processing techniques including chi-square tests and correlation. It provides descriptions of chi-square distribution, degrees of freedom, and how to perform a chi-square test. Steps to calculate correlation coefficients are also outlined. The document then applies these techniques to example datasets to analyze relationships between variables and check for independence.

Uploaded by

Vipul Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Practical No: 1

Title: Data pre-processing-chi square &


correlation
Roll No:4808

Description
Chi-squared test, a statistical method, is used by machine learning
methods to check the correlation between two categorical
variables.

Chi-Square distribution
A random variable ꭓ follows chi-square distribution if it can be
written as a sum of squared standard normal variables.

Degrees of freedom
Degrees of freedom refers to the maximum number of logically
independent values, which have the freedom to vary.

A chi-square test is used in statistics to test the independence of


two events. Given the data of two variables, we can get observed
count O and expected count E. Chi-Square measures how
expected count E and observed count O deviates each other.
Steps to perform the Chi-Square Test
● Define Hypothesis.
● Build a Contingency table.
● Find the expected values.
● Calculate the Chi-Square statistic.
● Accept or Reject the Null Hypothesis.

Role/Importance
The Chi-square test is intended to test how likely it is that an
observed distribution is due to chance. It is also called a
“goodness of fit” statistic,because it measures how well the
observed distribution of data fits with the distribution that is
expected if the variables are independent.

A Chi-square test is designed to analyze categorical data. That


means that the
data has been counted and divided into categories.

5% level of significance – prob of rejecting null hypothesis when it


is true

Problem

Solution
First install the R library

Dataset 

Program 
Output 

Dataset –

Program –

Output -
Interpretation
Null hypothesis H0:
Service and Salary are independent

Alternative Hypothesis H1:


Service and Salary are dependent

p-value: 0.2796
p- value is greater than 0.05

Conclusion
Null Hypothesis is accepted
Hence there is no  relationship between the service provided and
salary.

Data Preparation - Correlation coefficient


Description

Correlation test is used to evaluate the association between two


or more variables.

The variables may be two columns of a given data set of


observations, often called a sample, or two components of a
multivariate random variable with a known distribution.
The Pearson product-moment correlation coefficient is a measure
of the strength and direction of the linear relationship between
two variables that is defined as the covariance of the variables
divided by the product of their standard deviations. This is the
best-known and most commonly used type of correlation
coefficient.

Strength: The greater the absolute value of the correlation


coefficient, the stronger the relationship.

Direction: The sign of the correlation coefficient represents the


direction of the
relationship.

The values range between -1.0 and 1.0. 


A calculated number greater than 1.0 or less than -1.0 means
that there was an error in the correlation measurement. 
A correlation of -1.0 shows a perfect negative correlation,
while a correlation of 1.0 shows a perfect positive
correlation. 
A correlation of 0.0 shows no relationship between the
movement of the two variables. 

First install the R library

Dataset –

Program –

Output –
Interpretation

Conclusion  - 
As the age increases the glucose count increases 

Practical no:2
Title: PCA
Roll No: 4808
1. Import dataset in R
data<-read.csv('C:/kunal_ganjale_4808_Ty/DS/code/
heart.csv')

2. Print values of dataset


head(data)

3. Calculate principle component analysis


require(stats)
pc <- prcomp(x = data[-1],
center = TRUE,
scale. = FALSE)
print(pc)
summary(pc)
4. Plot the PCA’s
require(ggbiplot)
ggbiplot(results)
Conclusion:
More variations are observed in PC1 than PC2. We can give more
importance to target, thalach, slope, cp because they show more
variance and less covariance.
Practical No:3
Title: Exploratory data analysis
Roll No:4808

Exploratory Data Analysis


 Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to
spot anomalies, to test hypothesis and to check assumptions
with the help of summary statistics and graphical
representations.
 At a high level, EDA is the practice of describing the data by
means of statistical and visualization techniques to bring
important aspects of that data into focus for further
analysis. 
 This involves looking at your data set from many angles,
describing it, and summarizing it without making any
assumptions about its contents. 
 This is a significant step to take before dExploratory iving
into machine learning or statistical modeling, to make sure
the data are really what they are claimed to be and that
there are no obvious problems. 

Data set
The datasets consist of several medical predictor (independent)
variables and one target (dependent) variable, Outcome.
Independent variables include the number of pregnancies the
patient has had, their BMI, insulin level, age, and so on.
Data Description
 Pregnancies: Number of times pregnant
 Glucose: Plasma glucose concentration 2 hours in an oral
glucose tolerance test
 Blood Pressure: Diastolic blood pressure (mm Hg)
 Skin Thickness: Triceps skin fold thickness (mm)
 Insulin: 2-Hour serum insulin (mu U/ml)
 BMI: Body mass index (weight in kg/(height in m)^2)
 Diabetes Pedigree Function: Diabetes pedigree function
 Age: Age (years)
 Outcome: Class variable (0 or 1)  1 indicates diabetes is
present

Case1: diabetes patient analysis


Code:
diabet<-
read.csv('C:/kunal_ganjale_4808_Ty/DS/code/diabetes.csv')
head(diabet)
str(diabet)
summary(diabet)
diabet[1:10,]
diabet[,1:2]
diabet[1:10,1:2]
newdata1<-subset(diabet,diabet$Outcome=="1")
newdata1
newdata2<-subset(diabet,diabet$Pregnancies=="1"
&diabet$Outcome=="1")
newdata2
newdata3<-subset(diabet,diabet$Pregnancies=="1" |
diabet$Outcome=="0",select=c(1,2))
newdata3
newdata4<-diabet[order(diabet$BMI), ]
newdata4
newdata5<-diabet[order(-diabet$BMI),]
newdata5
newdata6<-aggregate(BMI~Outcome,data=diabet,FUN=mean)
newdata6
names(diabet)
colSums(is.na(diabet))
hist(diabet$BMI,col='RED')
plot(diabet$BMI)
boxplot(diabet$BMI)
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)
boxplot(newdata2$SkinThickness)

attach(diabet)
diabet
class(BMI)
table(Outcome)
count<-table(Outcome)
barplot(count,col=2)
pie(count)
table(Pregnancies)
count<-table(Pregnancies)
barplot(count)
pie(count)

output:
Conclusion:

Here we can see the BMI feature of our dataset showing


symmetric distribution as the histogram produces symmetric
curve over bars of histogram.
As the distribution of points are welly scattered on plot we can
conclude BMI produces symmetric distribution.

There are more diabetic people in age range 30-50


there are less diabetic patient in age group 20-25
so we conclude that age matters diabetics.

Pregnancies

Diabetics
Non-diabetics

Here we can see that very minute chance to get diabetics is


increases with no of pregnancies
Blood pressure ranges from 60-90 in diabetic’s patients

Blood pressure ranges from 60-90 in non-diabetic’s patients


Glucose in diabetics

Glucose in non-diabetics

Here we can observe chance to get diabetics increases with


increate in glucose level.
This graph gives number of diabetic and non diabetic persons
This graph give women’s and the number of pregnancies they had

Case2: car acceleration and mpg analysis


Code:
data<- read.csv('D:/college/sem_6/data science/code/auto-
mpg.csv')
head(data)
str(data)
summary(data)
library(dplyr)
newdata1<-subset(data,data$mpg <=40)
newdata1<-subset(newdata1,newdata1$acceleration <=21
&newdata1$acceleration >=9)
newdata2<-subset(newdata1,newdata1$mpg <=35)
newdata2
hist(newdata2$weight,col='RED')
boxplot(data$mpg)
boxplot(newdata1$mpg)
boxplot(data$acceleration)
boxplot(newdata1$acceleration)
boxplot(data$weight)
boxplot(newdata1$mpg)
newdata2<-subset(data,data$cylinders==8
&data$modelyear==70)
newdata1
names(data)
colSums(is.na(data))
hist(data$mpg,col='RED')
plot(data$acceleration)
boxplot(data$acceleration)
mean(data$acceleration)
median(data$acceleration)
max(data$acceleration)
min(data$acceleration)

attach(newdata1)
class(cylinders)
table(newdata1$modelyear)
count<-table(newdata1$modelyear)
barplot(count,col=2)
pie(count)
table(newdata1$origin)
count<-table(newdata1$origin)
barplot(count)
pie(count)

Output:
Boxplot acceleration

Boxplot mpg
Box plot of weight

After removal of outlier


Boxplot acceleration
Boxplot mpg

Conclusion:
Acceleration shows symmetric data distribution
for range 14-16 acceleration gives maximum milage

For weight range 2000-3000 cars gives maximum milage


Maximum car are generated by country1
Practical No: 4
Title: Decision Tree
Roll No: 4808

A decision tree is a decision support tool that uses a tree-like


graph or model of decisions and their possible consequences,
including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm that only contains conditional
control statements.
A decision tree is a flowchart-like structure in which each internal
node represents a “test” on an attribute (e.g. whether a coin flip
comes up heads or tails), each branch represents the outcome of
the test, and each leaf node represents a class label (decision
taken after computing all attributes). The paths from root to leaf
represent classification rules.
Tree based learning algorithms are considered to be one of the
best and mostly used supervised learning methods. Tree based
methods empower predictive models with high accuracy, stability
and ease of interpretation. Unlike linear models, they map non-
linear relationships quite well. They are adaptable at solving any
kind of problem at hand (classification or regression). Decision
Tree algorithms are referred to as CART (Classification and
Regression Trees).

Common terms used with Decision trees:


 Root Node: It represents entire population or sample and
this further gets divided into two or more homogeneous
sets.
 Splitting: It is a process of dividing a node into two or more
sub-nodes.
 Decision Node: When a sub-node splits into further sub-
nodes, then it is called decision node.
 Leaf/ Terminal Node: Nodes do not split is called Leaf or
Terminal node.
 Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of
splitting.
 Branch / Sub-Tree: A sub section of entire tree is called
branch or sub-tree.
 Parent and Child Node: A node, which is divided into sub-
nodes is called parent node of sub-nodes whereas sub-nodes
are the child of parent node.

Heart Disease dataset


Code:

heart<-read.csv('D:/college/sem_6/data science/code/heart.csv')
summary(heart)
names(heart)
library(partykit)
heart$target<-as.factor(heart$target)#convert to categorical
summary(heart$target)
names(heart)
set.seed(1234)
pd<-sample(2,nrow(heart),replace = TRUE, prob=c(0.8,0.2))#two
samples with distribution 0.8 and 0.2
trainingset<-heart[pd==1,]#first partition
validationset<-heart[pd==2,]#second partition
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset)
class(heart$target)
plot(tree)
#Prunning
tree<-ctree(formula = target ~ age + sex + cp + trestbps + chol + fbs +
restecg + thalach + exang + oldpeak + slope + ca +
thal ,data=trainingset,control=ctree_control(mincriterion =
0.99,minsplit = 500))
plot(tree)
pred<-predict(tree,validationset,type="prob")
pred
pred<-predict(tree,validationset)
pred
library(caret)
confusionMatrix(pred,validationset$target)
pred<-predict(tree,validationset,type="prob")
pred
library(pROC)
plot(roc(validationset$target,pred[ ,2]))
library(rpart)
fit <- rpart(target ~ age + sex + cp + trestbps + chol + fbs + restecg +
thalach + exang + oldpeak + slope + ca +
thal,data=titanic,method="class")
plot(fit)
text(fit)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(fit)
Prediction <- predict(fit, titanic, type = "class")
Prediction

Output:
Conclusion:
 Accuracy: 73.36%
 Green color in tree = false for heart disease
 Blue color in tree = true for heart disease
 Darker shades mean more true/false for heart disease
 The ROC graph shows that the model is not very accurate as
sensitivity and specificity are almost same.
 The true positive is not high enough so accuracy is medium.
 Specificity & sensitivity should be greater than 80 for proper
accuracy but only sensitivity is greater than 80.
partykit: A Toolkit for Recursive Partitioning
A toolkit with infrastructure for representing, summarizing, and
visualizing tree-structured regression and classification models.
This unified infrastructure can be used for reading/coercing tree
models from different sources ('rpart', 'RWeka', 'PMML') yielding
objects that share functionality for print ()/plot ()/predict ()
methods.
Caret:
The caret package (short for Classification and Regression
Training) contains functions to streamline the model training
process for complex regression and classification problems. 
pROC
pROC is a set of tools to visualize, smooth and compare receiver
operating characteristic (ROC curves). (Partial) area under the
curve (AUC) can be compared with statistical tests based on U-
statistics or bootstrap. Confidence intervals can be computed for
(p)AUC or ROC curves.

Rattle
A package written in R providing a graphical user interface to
very many other R packages that provide functionality for data
mining.
Data.table
Data. table is an extension of data. frame package in R. It is widely
used for fast aggregation of large datasets, low latency
add/update/remove of columns, quicker ordered joins, and a fast
file reader.
rpart.plot
This function combines and extends plot. rpart and text. rpart in
the rpart package. It automatically scales and adjusts the
displayed tree for best fit.
practical No: 5
Title: Clustering
Roll No: 4808

1. Import dataset
data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/
coordinate.csv')
data<-data[1:150,]
names (data)

2. Making subset containing x feature

3. Checking outliers using boxplot


For X
For Y

4. Calculating K-means to make 2 clusters


cl<-kmeans(new_data, 2)
cl
5. Calculate WSS
data<-new_data
wss<-sapply(1:15, function(k){kmeans(data,k)}$tot.withinss)
wss

6. Plot elbow graph


plot(1:15, wss, type="b", pch = 19, frame = FALSE,
xlab="Number of clusters",ylab= "total within-clusters sum of
squares")
7. Silhouette graph
library(factoextra)
fviz_nbclust(data, pam, method = "silhouette")
8. Plot clusters
library(cluster)
clusplot(new_data, cl$cluster, color=TRUE, shade=TRUE,
labels=FALSE, lines=0)
9. Classification of points based on cluster
cl$cluster
cl$centers

10. Hierarchical clustering based on y feature


clusters <- hclust(dist(data[, 0:1]), method = 'average')
clustercut1 <- cutree(clusters, 2)
table(clustercut1, data$y)
plot(clusters)

ggplot(data, aes(X, y)) +


geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clustercut1) +
scale_color_manual (values = c("red", "green","black","blue"))
11. Hierarchical clustering based on x feature
clusters <- hclust(dist(data[, 0:1]))
plot(clusters)
clustercut <- cutree(clusters, 2)
table(clustercut, data$X)

library(ggplot2)
ggplot(data, aes(X,y)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col =
clustercut) +
scale_color_manual (values = c("red", "green","black","blue"))
12. DBSCAN clustering
Conclusion:
As we can see the box plot of both x and y feature there is no
outliers were present, thus no need to construct features.
Both elbow method and silhouette methods give 2 cluster as
optimal clusters, thus we can use to make to clusters
We construct clusters using k means, Hierarchical and DBSCAN
clustering method but K means clusters shows good
representation of clustering data than remaining both
It is found that clusters are made using x feature as prime aspect.
As we can see in below figure as x value increases then second
cluster is got formed, we conclude that by observing y feature in
both clusters contain high value but first cluster have low x
feature values than second cluster, thus in this case x feature is
prime aspect for clustering.
Practical No:6
Title: Association
Roll No:4808

Association:
Association is a data mining technique that discovers the
probability of the co-occurrence of items in a collection. The
relationships between co-occurring items are expressed as
Association Rules. Association rule mining finds interesting
associations and relationships among large sets of data items.
Association rules are "if-then" statements, that help to show the
probability of relationships between data items, within large data
sets in various types of databases. Here the If element is called
antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or
relation between two items is known as single cardinality.
Association rule mining has a number of applications and is
widely used to help discover sales correlations in transactional
data or in medical data sets.

Apriori:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994
for finding frequent itemsets in a dataset for boolean association
rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used
to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent
itemsets, an important property is used called Apriori property
which helps by reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must
be frequent. 
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. 
The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum
support or large itemsets i.e. it is not an efficient approach for
large number of datasets. It will check for many sets from
candidate itemsets, also it will scan database many times
repeatedly for finding candidate itemsets. Apriori will be very low
and inefficiency when memory capacity is limited with large
number of transactions.
Algorithm
 Calculate the support of item sets (of size k = 1) in the
transactional database (note that support is the frequency of
occurrence of an itemset). This is called generating the
candidate set.
 Prune the candidate set by eliminating items with a support
less than the given threshold.
 Join the frequent itemsets to form sets of size k + 1, and
repeat the above sets until no more itemsets can be formed.
This will happen when the set(s) formed have a support less
than the given support.
OR
1. Set a minimum support and confidence.
2. Take all the subset present in the transactions which have
higher support than minimum support.
3. Take all the rules of these subsets which have higher
confidence than minimum confidence.
4. Sort the rules by decreasing lift.
Components of Apriori
Support and Confidence:
Support refers to items’ frequency of occurrence i.e. x and y items
are purchased together, confidence is a conditional probability
that y item is purchased given that x item is purchased. 
Support( I )=( Number of transactions containing item I ) / ( Total
number of transactions )
Confidence( I1 -> I2 ) =( Number of transactions containing I1 and
I2 ) / ( Number of transactions containing I1 )

Lift: 
Lift gives the correlation between A and B in the rule A=>B.
Correlation shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule
can be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the
degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Lift( I1 -> I2 ) = ( Confidence( I1 -> I2 ) / ( Support(I2) )
Coverage:
Coverage (also called cover or LHS-support) is the support of the
left-hand-side of the rule X => Y , i.e., supp(X). 
It represents a measure of to how often the rule can be applied.
Coverage can be quickly calculated from the rule’s quality
measures (support and confidence) 
Fp tree:
The FP-Growth Algorithm proposed by Han in The FP-Growth
Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so
much, it uses a divide-and-conquer strategy. The core of this
method is the usage of a special data structure named frequent-
pattern tree (FP-tree), which retains the item set association
information. This tree-like structure is made with the initial
itemsets of the database. The purpose of the FP tree is to mine the
most frequent pattern. Each node of the FP tree represents an
item of the itemset. 
The root node represents null while the lower nodes represent
the itemsets. The association of the nodes with the lower nodes
that is the itemsets with the other itemsets are maintained while
forming the tree.
Algorithm:
Building the tree

Find Patterns Having p From P-conditional Database

Calculate conditional frequent pattern tree.


Dataset: supermarket.csv

1. Import libraries
library(arules)
library(arulesViz)
library(RColorBrewer)

2. Import dataset
data<-read.transactions('D:/college/sem_6/data
science/code/supermarket.csv', rm.duplicates= TRUE,
format="single",sep=",",header =
TRUE,cols=c("Branch","Product line"))
#data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/Super
Store.csv')
#data <- subset(data, select = c(0,1))

3. Display structure of data


str(data)
4. Items and transaction ids
inspect(head(data))

5. Labels of items
data@itemInfo$labels

6. Generating rules
data_rules <- apriori(data, parameter = list(supp = 0.01, conf =
0.2))
data_rules
7. Inspect rules
inspect(data_rules[1:20])

8. Inspect top 10 rules


inspect(head(sort(data_rules, by = "confidence"), 10))

9. Inspect bottom 10 rules


inspect(tail(sort(data_rules, by = "confidence"), 10))
10. Determine rules which reach to fashion
accessories
fashion_rules <- apriori(data=data, parameter=list
(supp=0.001,conf = 0.08), appearance = list (rhs="Fashion
accessories"))

inspect(head(sort(fashion_rules, by = "confidence"), 10))


11. Determine rules which reach to fashion
accessories with increased support
fashion_rules_increased_support <- apriori(data, parameter =
list(support =0.02, confidence = 0.5))

inspect(head(sort(fashion_rules_increased_support, by =
"confidence"), 10))

12. Plot absolute item frequency graph


itemFrequencyPlot(data,topN=20,type="absolute",col=brewer.
pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
Practical No: 7
Title: Time Series
Roll No:4808

1. Import CSV file


data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/
income1.csv')
attach(data)
head(data)

2. Assign x variable for month and y for total filled jobs


x<-Month
y<-Total.Filled.Jobs
3. Calculate difference between pairs of consecutive element
vectors
d.y<-diff(y)

4. Calculate and plot acf


acf(y)
5. Calculate and plot pacf
pacf(y)

6. Calculate and plot acf of d.y


acf(d.y)
7. Generate arima model for y variable based on month
arima(y,order = c(1,0,0))

8. Store arima values in table


mydata.arima001<-arima(y,order=c(0,0,1))
9. Predict 100 values using arima values
mydata.pred01<-predict(mydata.arima001,n.ahead = 100)
head(mydata.pred01)

10. Plot y values


Plot(y)
11. Display head and tail of predicted values from
prediction table
tail(mydata.pred01$pred)
head(mydata.pred01$pred)

Conclusion:
As we can see trend of flow of y parameter as blue line we can
conclude that model is predicting values as per trend
Practical No:8
Title: MongoDB
Roll No:4808

1. Extract mongodb zip file in C;\ drive


2. create folder data in C:\
3. Create folder db in C:\data\db
4. Goto C:\mongodb\bin and click on mongod.exe and keep
server running
Click on mongo.exe
Create database kunal

Create table student and insert records


> db.student.insert({name:"kunal",age:22,address:
[{city:"mumbai"},{pin:400614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Sajjad",age:22,address:
[{city:"Dombivli"},{pin:401614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Pankaj",age:21,address:[{city:"Pune"},
{pin:406721}]})
WriteResult({ "nInserted" : 1 })
db.student.insert({name:"Akshay",age:24,address:[{city:"Pune"},
{pin:456765}]})
db.student.insert({name:"Yash",age:21,address:[{city:"Satara"},
{pin:345234}]})

Display inserted records


Create table student_mark and insert records in it
> db.student_mark.insert({name:"kunal",marks:[{physics:79},
{chem:89},{bio:87}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Sajjad",marks:[{physics:90},
{chem:79},{bio:84}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Pankaj",marks:[{physics:76},
{chem:89},{bio:67}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Akshay",marks:[{physics:63},
{chem:78},{bio:88}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Yash",marks:[{physics:71},
{chem:55},{bio:65}]})
WriteResult({ "nInserted" : 1 })
Display record in json format
db.student.find().forEach(printjson)

display details of student who’s age is greater than 22

> db.student.find({age:{$gt:22}}).pretty()
display details of student who’s city is pune

db.student.find({'address.city':'Pune'}).pretty()
Display student who got more than 84 marks in physics

Display student who got less than 85 marks in physics


db.student_mark.find({'marks.bio':{$lte:84}}).pretty()
Display students who lives in pune or mumbai and whos age
is greatter than 21

db.student.find({'address.city':{$in:["Pune","mumbai"]},age:
{$gte:21}}).pretty()
Display students who got more than 70 marks in all subject
db.student_mark.find({'marks.bio':{$gte:70},'marks.physics':
{$gte:70},'marks.chem':{$gte:70}}).pretty()

Display student who got 84 marks in bio


db.student_mark.find({'marks.bio':84}).pretty()

Update student name to Anurag who got 84 marks in


bio
db.student_mark.update({'marks.bio':84},{$set:
{'name':'anurag'}})
Remove student who’s name is anurag
db.student_mark.remove({'name':'anurag'},1)

Delete collection
db.student_mark.drop()

Drop table
db.dropDatabase()
Practical No:9
Title: topic Modelling
Roll No:4808

Load all the text files from folder


library(tm)
library(topicmodels)
setwd("C:/british-fiction-corpus")
filenames<-list.files(path="C:/british-fiction-
corpus",pattern="*.txt")
filenames

Find all word in all files with the specific length


filetext<-lapply(filenames,readLines)#lapply returns a list of the
same length as X, applying FUN to the corresponding element of X.
mycorpus<-Corpus(VectorSource(filetext))# VectorSource
interprets each element of the vector x as a document.
mycorpus<-tm_map(mycorpus,removeNumbers)
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus

provide list of stopwords to fin in each text files and


map it on words from files
mystopwords=c("of","a","and","the","in","to","for","that","is","on",
"are","with","as","by"
,"be","an","which","it","from","or","can","have","these","ha
s","such","you")
mycorpus<-tm_map(mycorpus,tolower)
mycorpus<-tm_map(mycorpus,removeWords,mystopwords)
dtm<-DocumentTermMatrix(mycorpus)
k<-3

#lda_output_3<-LDA(dtm,k,method="VEM",control=control_VEM)
# control_VEM

#lda_output_3<-LDA(dtm,k,method="VEM",control=NULL)
lda_output_3<-LDA(dtm,k,method="VEM")

#lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3@Dim
#lda_output_3<-LDA(dtm,k,method="VEM")
#show (dtm)

topics(lda_output_3)
terms(lda_output_3,10)
Output:

Conclusion:
The keywords used in all the texts files are more suitable for
natural language processing (NPL)

You might also like