0% found this document useful (0 votes)
21 views12 pages

Lab 3 Data Mining NoSQl - Harshil - Parmar

This document provides a comprehensive guide on using MongoDB with Python for non-relational data mining. It covers the installation of MongoDB, the use of the PyMongo driver, and various data manipulation techniques such as inserting, retrieving, filtering, and sorting documents within a MongoDB collection. Additionally, it includes reflection tasks for further practice with data deletion and updating operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Lab 3 Data Mining NoSQl - Harshil - Parmar

This document provides a comprehensive guide on using MongoDB with Python for non-relational data mining. It covers the installation of MongoDB, the use of the PyMongo driver, and various data manipulation techniques such as inserting, retrieving, filtering, and sorting documents within a MongoDB collection. Additionally, it includes reflection tasks for further practice with data deletion and updating operations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Objective

Apply non relational datamining techniques in Python using mongodb


Practice NoSQl datamining techniques and reflect

What is MongoDB?
MongoDB is a document database that can be installed in the local machine or hosted in the
cloud. The flavour of mongodb in the cloud is call MongoDB atlas.
It stores JSON-like documents providing flexibility and scalability

For this lab we will download the MongoDB community server from this
https://www.mongodb.com/try/download/community link

PyMongo
PyMongo is a python based driver that is required to access the MongoDB database

!pip install pymongo

Collecting pymongo
Downloading pymongo-4.11-cp313-cp313-win_amd64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.11-cp313-cp313-win_amd64.whl (932 kB)
---------------------------------------- 0.0/932.9 kB ? eta -:--:--
---------------------------------------- 0.0/932.9 kB ? eta -:--:--
---------------------- ----------------- 524.3/932.9 kB 2.4 MB/s
eta 0:00:01
---------------------------------------- 932.9/932.9 kB 2.1 MB/s
eta 0:00:00
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.11

import pymongo as pm

Create a sample database if it does not already


exists
conn = pm.MongoClient("mongodb://localhost:27017/") ### create
connection to the database

db=conn["firstmongo"] ### create a database named "firstmongo"


Create a collection "Class" with corresponding field values and thier
data types
Note: "Table" in MongoDB is called "Collection"

1. Create a collection by specifying the name of the collection (if it does not already exists)
2. In MongoDB the collection is not created unless there is a content in the collection.
Therefore there will be no blank collections (tables) in the database.
3. We will insert a single document (same as a record in the SQL table) using insert_one()
method
mycollection = db["class"]

mydict = { "name": "Shweta", "Course": "Info 6150" , "classsize":40}

x = mycollection.insert_one(mydict)

#print list of the _id values of the inserted documents:


print(x.inserted_id)

67a28a5f5aabe18fcc1bf3e6

#### Lets add more human readable unique_ids


mylist = [
{ "_id": 100, "name": "John", "Course": "Info 6150" , "grade":30},
{ "_id": 200, "name": "Peter", "Course": "Info 6150" , "grade":40},
{ "_id": 300, "name": "Amy", "Course": "Info 6150" , "grade":50},
{ "_id": 400, "name": "Hannah", "Course": "Info 6150" , "grade":60},
{ "_id": 500, "name": "Michael", "Course": "Info 6150" ,
"grade":70},
{ "_id": 600, "name": "Sandy", "Course": "Info 6150" , "grade":80},
{ "_id": 700, "name": "Betty", "Course": "Info 6150" , "grade":80},
{ "_id": 800, "name": "Richard", "Course": "Info 6150" ,
"grade":70},
{ "_id": 900, "name": "Susan", "Course": "Info 6150" , "grade":60},
{ "_id": 1000, "name": "Vicky", "Course": "Info 6150" , "grade":50},
{ "_id": 1100, "name": "Ben", "Course": "Info 6150" , "grade":85},
{ "_id": 1200, "name": "William", "Course": "Info 6150" ,
"grade":75},
{ "_id": 1300, "name": "Chuck", "Course": "Info 6150" , "grade":65},
{ "_id": 1400, "name": "Viola", "Course": "Info 6150" , "grade":55}
]

x = mycollection.insert_many(mylist)
print(x.inserted_ids)

[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300,
1400]

You will agree that this is a more easily understood unique id


Retreive data from the collection you just created
### Find One method
one = mycollection.find_one()
print(one)

### Find All method

for all in mycollection.find():


print(all)

{'_id': ObjectId('67a28a5f5aabe18fcc1bf3e6'), 'name': 'Shweta',


'Course': 'Info 6150', 'classsize': 40}
{'_id': ObjectId('67a28a5f5aabe18fcc1bf3e6'), 'name': 'Shweta',
'Course': 'Info 6150', 'classsize': 40}
{'_id': 100, 'name': 'John', 'Course': 'Info 6150', 'grade': 30}
{'_id': 200, 'name': 'Peter', 'Course': 'Info 6150', 'grade': 40}
{'_id': 300, 'name': 'Amy', 'Course': 'Info 6150', 'grade': 50}
{'_id': 400, 'name': 'Hannah', 'Course': 'Info 6150', 'grade': 60}
{'_id': 500, 'name': 'Michael', 'Course': 'Info 6150', 'grade': 70}
{'_id': 600, 'name': 'Sandy', 'Course': 'Info 6150', 'grade': 80}
{'_id': 700, 'name': 'Betty', 'Course': 'Info 6150', 'grade': 80}
{'_id': 800, 'name': 'Richard', 'Course': 'Info 6150', 'grade': 70}
{'_id': 900, 'name': 'Susan', 'Course': 'Info 6150', 'grade': 60}
{'_id': 1000, 'name': 'Vicky', 'Course': 'Info 6150', 'grade': 50}
{'_id': 1100, 'name': 'Ben', 'Course': 'Info 6150', 'grade': 85}
{'_id': 1200, 'name': 'William', 'Course': 'Info 6150', 'grade': 75}
{'_id': 1300, 'name': 'Chuck', 'Course': 'Info 6150', 'grade': 65}
{'_id': 1400, 'name': 'Viola', 'Course': 'Info 6150', 'grade': 55}

Return only specific document information and not all data points

Filter and Select

for all in mycollection.find({},{ "_id": 0, "name": 1, "grade": 1 }):


print(all)

{'name': 'Shweta'}
{'name': 'John', 'grade': 30}
{'name': 'Peter', 'grade': 40}
{'name': 'Amy', 'grade': 50}
{'name': 'Hannah', 'grade': 60}
{'name': 'Michael', 'grade': 70}
{'name': 'Sandy', 'grade': 80}
{'name': 'Betty', 'grade': 80}
{'name': 'Richard', 'grade': 70}
{'name': 'Susan', 'grade': 60}
{'name': 'Vicky', 'grade': 50}
{'name': 'Ben', 'grade': 85}
{'name': 'William', 'grade': 75}
{'name': 'Chuck', 'grade': 65}
{'name': 'Viola', 'grade': 55}

### Print everything but exclude grades


for all_but in mycollection.find({},{ "grade": 0 }):
print(all_but)

{'_id': ObjectId('67a28a5f5aabe18fcc1bf3e6'), 'name': 'Shweta',


'Course': 'Info 6150', 'classsize': 40}
{'_id': 100, 'name': 'John', 'Course': 'Info 6150'}
{'_id': 200, 'name': 'Peter', 'Course': 'Info 6150'}
{'_id': 300, 'name': 'Amy', 'Course': 'Info 6150'}
{'_id': 400, 'name': 'Hannah', 'Course': 'Info 6150'}
{'_id': 500, 'name': 'Michael', 'Course': 'Info 6150'}
{'_id': 600, 'name': 'Sandy', 'Course': 'Info 6150'}
{'_id': 700, 'name': 'Betty', 'Course': 'Info 6150'}
{'_id': 800, 'name': 'Richard', 'Course': 'Info 6150'}
{'_id': 900, 'name': 'Susan', 'Course': 'Info 6150'}
{'_id': 1000, 'name': 'Vicky', 'Course': 'Info 6150'}
{'_id': 1100, 'name': 'Ben', 'Course': 'Info 6150'}
{'_id': 1200, 'name': 'William', 'Course': 'Info 6150'}
{'_id': 1300, 'name': 'Chuck', 'Course': 'Info 6150'}
{'_id': 1400, 'name': 'Viola', 'Course': 'Info 6150'}

In any of the above filters we cannot use 0, & 1 both for the return fields unless one of the fields
is a primary key

i.e. for x in mycollection.find({},{ "name": 1, "Course": 0 }): ## incorrect

for x in mycollection.find({},{ "name": 1, "_id": 0 }): ## correct

# Filter by key value


doc=mycollection.find({"grade":80})

for x in doc:
print(x)

{'_id': 600, 'name': 'Sandy', 'Course': 'Info 6150', 'grade': 80}


{'_id': 700, 'name': 'Betty', 'Course': 'Info 6150', 'grade': 80}

# Filter by text contains a letter or higher


doc=mycollection.find({"name":{"$gt":"V"}})

for x in doc:
print(x)

{'_id': 1000, 'name': 'Vicky', 'Course': 'Info 6150', 'grade': 50}


{'_id': 1200, 'name': 'William', 'Course': 'Info 6150', 'grade': 75}
{'_id': 1400, 'name': 'Viola', 'Course': 'Info 6150', 'grade': 55}
Insert Data into the table created above
### Filter using exact Letter

doc=mycollection.find({"name":{"$regex":"^V"}})

for x in doc:
print(x)

{'_id': 1000, 'name': 'Vicky', 'Course': 'Info 6150', 'grade': 50}


{'_id': 1400, 'name': 'Viola', 'Course': 'Info 6150', 'grade': 55}

Sorting

mydoc = mycollection.find().sort("name")

for x in mydoc:
print(x)

{'_id': 300, 'name': 'Amy', 'Course': 'Info 6150', 'grade': 50}


{'_id': 1100, 'name': 'Ben', 'Course': 'Info 6150', 'grade': 85}
{'_id': 700, 'name': 'Betty', 'Course': 'Info 6150', 'grade': 80}
{'_id': 1300, 'name': 'Chuck', 'Course': 'Info 6150', 'grade': 65}
{'_id': 400, 'name': 'Hannah', 'Course': 'Info 6150', 'grade': 60}
{'_id': 100, 'name': 'John', 'Course': 'Info 6150', 'grade': 30}
{'_id': 500, 'name': 'Michael', 'Course': 'Info 6150', 'grade': 70}
{'_id': 200, 'name': 'Peter', 'Course': 'Info 6150', 'grade': 40}
{'_id': 800, 'name': 'Richard', 'Course': 'Info 6150', 'grade': 70}
{'_id': 600, 'name': 'Sandy', 'Course': 'Info 6150', 'grade': 80}
{'_id': ObjectId('67a28a5f5aabe18fcc1bf3e6'), 'name': 'Shweta',
'Course': 'Info 6150', 'classsize': 40}
{'_id': 900, 'name': 'Susan', 'Course': 'Info 6150', 'grade': 60}
{'_id': 1000, 'name': 'Vicky', 'Course': 'Info 6150', 'grade': 50}
{'_id': 1400, 'name': 'Viola', 'Course': 'Info 6150', 'grade': 55}
{'_id': 1200, 'name': 'William', 'Course': 'Info 6150', 'grade': 75}

To only return few documents one can define the limit


my5 = mycollection.find().limit(5)

#print the result:


for x in my5:
print(x)

{'_id': ObjectId('67a28a5f5aabe18fcc1bf3e6'), 'name': 'Shweta',


'Course': 'Info 6150', 'classsize': 40}
{'_id': 100, 'name': 'John', 'Course': 'Info 6150', 'grade': 30}
{'_id': 200, 'name': 'Peter', 'Course': 'Info 6150', 'grade': 40}
{'_id': 300, 'name': 'Amy', 'Course': 'Info 6150', 'grade': 50}
{'_id': 400, 'name': 'Hannah', 'Course': 'Info 6150', 'grade': 60}
Reflection Task (5 points)
Add additional code cells and perform the tasks as suggested below

1. Using the logic above implement sort ascending and descending Hint: sort("name",
1) #ascending sort("name", -1) #descending

2. Delete One where "name":"Ron"


mycollection.delete_one(______)

1. Delete Many and check for deletion where "name":{"$regex": "^V"}


print(x.deleted_count, " documents deleted.")

2. Delete all remaining documents in collection x = mycollection.delete_many({})

Additional Tip >> Drop and Update


Obviously in the above scenario and as discussed in the very beginning if there are no values in a
collection the collection will not exist. So if we were not doing the above exercises, we can
delete the entire table i.e. collection by using drop() function as we used in SQL commands

mycollection.drop()

Similarly we can update one and update many using the same logic

mylectioncol.update_ooldvalues={"name":"Shweta"}ernewvalues={"$set":"Shwetz"}u

mycollection.update_many(oldvalues, newvaluess)

import seaborn as sns


import pandas as pd
import matplotlib.pyplot as plt
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
mycollection = db["class"]

# Insert sample document


mydict = { "name": "Shweta", "Course": "Info 6150" , "classsize": 40 }
x = mycollection.insert_one(mydict)

# Sorting documents
ascending_sort = mycollection.find().sort("name", 1) # Ascending
order
print("Ascending Sort:")
for doc in ascending_sort:
print(doc)

descending_sort = mycollection.find().sort("name", -1) # Descending


order
print("Descending Sort:")
for doc in descending_sort:
print(doc)

# Delete one document where name is "Ron"


mycollection.delete_one({"name": "Ron"})

# Delete many documents where name starts with "V"


x = mycollection.delete_many({"name": {"$regex": "^V"}})
print(x.deleted_count, "documents deleted.")

# Delete all remaining documents


x = mycollection.delete_many({})
print(x.deleted_count, "documents deleted.")

# Load Titanic dataset


titanic = sns.load_dataset('titanic')

# Explore the dataset and display basic statistics


print("Titanic Dataset Info:")
print(titanic.info())
print("\nBasic Statistics:")
print(titanic.describe(include='all'))

# Box plot for age distribution by class


plt.figure(figsize=(8, 6))
sns.boxplot(x='class', y='age', data=titanic, palette='coolwarm')
plt.title('Age Distribution by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.show()

# Bar plot for male and female passenger count


plt.figure(figsize=(6, 4))
sns.countplot(x='sex', data=titanic, palette='pastel')
plt.title('Count of Male and Female Passengers')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.show()

# Load Iris dataset


iris = sns.load_dataset('iris')

# Summary statistics for sepal length by species


summary_stats = iris.groupby('species')['sepal_length'].agg(['mean',
'median', 'std'])
print("\nSummary Statistics for Sepal Length by Species:")
print(summary_stats)
# Scatter plot for petal length vs. petal width with size representing
sepal length/sepal width ratio
plt.figure(figsize=(8, 6))
sns.scatterplot(x='petal_length', y='petal_width', hue='species',
size=iris['sepal_length'] / iris['sepal_width'], palette='viridis',
sizes=(20, 200), data=iris)
plt.title('Petal Length vs. Petal Width with Size Representing Sepal
Ratio')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(title='Species')
plt.show()

Ascending Sort:
{'_id': ObjectId('67a28b8c5aabe18fcc1bf3e8'), 'name': 'Shweta',
'Course': 'Info 6150', 'classsize': 40}
Descending Sort:
{'_id': ObjectId('67a28b8c5aabe18fcc1bf3e8'), 'name': 'Shweta',
'Course': 'Info 6150', 'classsize': 40}
0 documents deleted.
1 documents deleted.
Titanic Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
None

Basic Statistics:
survived pclass sex age sibsp
parch \
count 891.000000 891.000000 891 714.000000 891.000000
891.000000
unique NaN NaN 2 NaN NaN
NaN
top NaN NaN male NaN NaN
NaN
freq NaN NaN 577 NaN NaN
NaN
mean 0.383838 2.308642 NaN 29.699118 0.523008
0.381594
std 0.486592 0.836071 NaN 14.526497 1.102743
0.806057
min 0.000000 1.000000 NaN 0.420000 0.000000
0.000000
25% 0.000000 2.000000 NaN 20.125000 0.000000
0.000000
50% 0.000000 3.000000 NaN 28.000000 0.000000
0.000000
75% 1.000000 3.000000 NaN 38.000000 1.000000
0.000000
max 1.000000 3.000000 NaN 80.000000 8.000000
6.000000

fare embarked class who adult_male deck embark_town


alive \
count 891.000000 889 891 891 891 203 889
891
unique NaN 3 3 3 2 7 3
2
top NaN S Third man True C Southampton
no
freq NaN 644 491 537 537 59 644
549
mean 32.204208 NaN NaN NaN NaN NaN NaN
NaN
std 49.693429 NaN NaN NaN NaN NaN NaN
NaN
min 0.000000 NaN NaN NaN NaN NaN NaN
NaN
25% 7.910400 NaN NaN NaN NaN NaN NaN
NaN
50% 14.454200 NaN NaN NaN NaN NaN NaN
NaN
75% 31.000000 NaN NaN NaN NaN NaN NaN
NaN
max 512.329200 NaN NaN NaN NaN NaN NaN
NaN

alone
count 891
unique 2
top True
freq 537
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

C:\Users\Hp15d\AppData\Local\Temp\ipykernel_15440\3701347398.py:48:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.boxplot(x='class', y='age', data=titanic, palette='coolwarm')


C:\Users\Hp15d\AppData\Local\Temp\ipykernel_15440\3701347398.py:56:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be


removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(x='sex', data=titanic, palette='pastel')

Summary Statistics for Sepal Length by Species:


mean median std
species
setosa 5.006 5.0 0.352490
versicolor 5.936 5.9 0.516171
virginica 6.588 6.5 0.635880

You might also like