W7 - MongoDB in Python (Me)
W7 - MongoDB in Python (Me)
[1]: # If the code below was executed with no errors, "pymongo" is installed and␣
,→ready to be used.
import pymongo
To run the Python code for MongoDB, the MongoDB server (mongod) must be run/started first.
To create a database in MongoDB, start by creating a MongoClient object, then specify a connection
URL with the correct ip address and the name of the database you want to create.
MongoDB will create the database if it does not exist, and make a connection to it.
2 MongoDB structure
JSON <> Python
JSON (JavaScript Object Notation) is the basis of MongoDB’s data format.
JSON has two collection structures. Objects map string keys to values, and arrays order values.
JSON data types have equivalents in Python. - JSON objects are like Python dictionaries with
string-type keys. - Arrays are like Python lists.
JSON <> Python <> MongoDB
1
• A MongoDB database maps names to collections. You can access collections by name the
same way you would access values in a Python dictionary.
• A collection, in turn, is like a list of dictionaries, called “documents” by MongoDB. When a
dictionary is a value within a document, that’s a subdocument.
• Values in a document can be any of the types I mentioned. MongoDB also supports some
types native to Python but not to JSON. Two examples are dates and regular expressions.
Accessing databases and collections
To access databases and collections from a client object. - One way is square bracket notation, as
if a client is a dictionary of databases, with database names as keys. A database in turn is like a
dictionary of collections, with collection names as keys. - - - Another way to access things is dot
notation. Databases are attributes of a client, and collections are attributes of a database.
[2]: """
#The code below fetches real and latest data from Nobel Prize website, and
# store it into collections
import pymongo
import json
import requests
2
for collection_name in ["prizes", "laureates"]:
# collect the date from the API
response = requests.get("http://api.nobelprize.org/v1/{}.json".\
format(collection_name[:-1]))
# convert the data to json
documents = response.json()[collection_name]
# Create collections on the fly
nobel[collection_name].insert_many(documents)
[2]: '\n#The code below fetches real and latest data from Nobel Prize website, and\n#
store it into collections\n\nimport pymongo\nimport json\nimport requests\n\n#
Connnect to local MongoDB server\nclient =
pymongo.MongoClient("mongodb://localhost:27017/")\n\n# Create a database
\'nobel\'\nnobel = client["nobel"]\n\n# Return a list of your system\'s
databases:\n# In MongoDB, a database is not created until it gets content.\n#
print(client.list_database_names())\n\n\nfor collection_name in ["prizes",
"laureates"]:\n # collect the date from the API\n response =
requests.get("http://api.nobelprize.org/v1/{}.json".
format(collection_name[:-1]))\n # convert the data to json\n documents =
response.json()[collection_name]\n # Create collections on the fly\n
nobel[collection_name].insert_many(documents)\n\n# Show the number of document
in a collection\nprint(nobel["prizes"].count_documents({}))\nprint(nobel["laurea
tes"].count_documents({}))\n\n# delete the
collections\nnobel["prizes"].drop()\nnobel["laureates"].drop()\n'
import json
from pymongo import MongoClient
# Create collections
prizes = nobel.prizes
3
laureates = nobel.laureates
# alternative ways to create collections
# prizes = nobel["prizes"]
# laureates = nobel["laureates"]
# check
print("List of collection names: ", nobel.list_collection_names())
print("\n")
# This method accepts an optional filter argument specifings the pattern that␣
,→the document must match
# You can specify no filter or an empty document filter ({}), in which case␣
,→MongoDB will
# return the document that is first in the internal order of the collection.
prize = prizes.find_one()
4
laureate = laureates.find_one()
# delete collections
#prizes.drop()
#laureates.drop()
# delete database
#client.drop_database('nobel')
5
'diedCountry': 'the Netherlands', 'diedCountryCode': 'NL', 'gender': 'male',
'prizes': [{'year': '1902', 'category': 'physics', 'share': '2', 'motivation':
'"in recognition of the extraordinary service they rendered by their researches
into the influence of magnetism upon radiation phenomena"', 'affiliations':
[{'name': 'Leiden University', 'city': 'Leiden', 'country': 'the
Netherlands'}]}]}
6
[14]: # Create a filter for Germany-born laureates who died in the USA and with the␣
,→first name "Albert"
# Save a filter for laureates who died in the USA and were not born there
criteria = { 'diedCountry': 'USA',
'bornCountry': { "$ne": 'USA'},
}
# Count them
count = nobel.laureates.count_documents(criteria)
print(count)
# Use dot notation. Filter for laureates born in Austria with non-Austria prize␣
,→affiliation
# Use dot notation. Filter for laureates with at least three prizes
criteria = {"prizes.2": {"$exists": True}}
# Find one laureate with at least three prizes
7
doc = laureates.find_one(criteria)
# Print the document
print(doc)
1
291
69
10
0
{'_id': ObjectId('5fcce8b621f0b9f44a855e18'), 'id': '482', 'firstname': 'Comité
international de la Croix Rouge (International Committee of the Red Cross)',
'born': '0000-00-00', 'died': '0000-00-00', 'gender': 'org', 'prizes': [{'year':
'1917', 'category': 'peace', 'share': '1', 'affiliations': [[]]}, {'year':
'1944', 'category': 'peace', 'share': '1', 'affiliations': [[]]}, {'year':
'1963', 'category': 'peace', 'share': '2', 'affiliations': [[]]}]}
[5]: # All the values (countries) for the "diedCountry" field. Convert the result to␣
,→a set
died_countries = set(laureates.distinct("diedCountry"))
# All the values (countries) for the "bornCountry" field. Convert the result to␣
,→a set
born_countries = set(laureates.distinct("bornCountry"))
# Countries recorded as countries of death but not as countries of birth
print(died_countries - born_countries)
# Save the set of distinct prize categories in documents satisfying the criteria
# i.e., returns all prize categories shared by three or more laureates.
8
triple_play_categories = set(prizes.distinct("category", criteria))
# Find the prize categories that are not shared by three or more laureates
print(set(prizes.distinct("category")) - triple_play_categories)
# Save a filter for organization laureates with prizes won in or after 1945
in_or_after = {
"gender": "org",
"prizes.year": {"$gte": "1945"},
}
n_before = laureates.count_documents(before)
n_in_or_after = laureates.count_documents(in_or_after)
ratio = n_in_or_after / (n_in_or_after + n_before)
print("Ratio: ", ratio)
9
{'Greece', 'Israel', 'East Germany', 'Gabon', 'Tunisia', 'USSR', 'Jamaica',
'Northern Rhodesia (now Zambia)', 'Barbados', 'Puerto Rico', 'Yugoslavia (now
Serbia)', 'Czechoslovakia', 'Philippines'}
29
{'literature'}
Ratio: 1.3653846153846154
Ratio: 0.84
# Fill in a string value to be sandwiched between the strings "^Germany " and␣
,→"now"
# Fill in a string value to be sandwiched between the strings "now" and "$"
criteria = {"bornCountry": Regex("now" + " Germany\)" + "$")}
print(set(laureates.distinct("bornCountry", criteria)))
10
# Save the field names corresponding to a laureate's first name and last name
first, last = "firstname", "surname"
print([(laureate[first], laureate[last]) for laureate in laureates.
,→find(criteria)])
{'East Friesland (now Germany)', 'Germany (now France)', 'Germany (now Poland)',
'Germany (now Russia)', 'West Germany (now Germany)', 'Germany', 'Hesse-Kassel
(now Germany)', 'Württemberg (now Germany)', 'Mecklenburg (now Germany)',
'Schleswig (now Germany)', 'Prussia (now Germany)', 'Bavaria (now Germany)'}
{'Germany (now Russia)', 'Germany', 'Germany (now France)', 'Germany (now
Poland)'}
{'Germany (now Russia)', 'Germany (now France)', 'Germany (now Poland)'}
{'East Friesland (now Germany)', 'West Germany (now Germany)', 'Hesse-Kassel
(now Germany)', 'Württemberg (now Germany)', 'Mecklenburg (now Germany)',
'Schleswig (now Germany)', 'Prussia (now Germany)', 'Bavaria (now Germany)'}
[('William Bradford', 'Shockley'), ('John', 'Bardeen'), ('Walter Houser',
'Brattain')]
[5]: # Find laureates whose first name starts with "G" and last name starts with "S"
# Use projection to select only firstname and surname
docs = laureates.find(
filter= {"firstname" : {"$regex" : "^G"},
"surname" : {"$regex" : "^S"} },
projection= ["firstname", "surname"] )
11
5.2 Sorting
We pass a “sort” argument to the find() method, giving a list of field-direction pairs. The list of
field-direction pairs can contain multiple entries, you can sort first by one field and then by other
fields, i.e. primary and secondary sorting.
• ascending: 1
• descending: -1
As an alternative to passing extra parameters to the find() method, we can chain the find() method
and the sort() method which takes one parameter for “fieldname” and one parameter for “direction”
(ascending is the default direction).
# This function that takes a prize document as an argument, extracts all the␣
,→laureates from
# extract surnames
surnames = [laureate["surname"] for laureate in sorted_laureates]
return all_names
# find physics prizes, project year and name, and sort by year
docs = prizes.find(
filter= {"category": "physics"},
projection= ["year", "laureates.firstname", "laureates.surname"],
sort= [("year", 1), ("laureates.surname", 1)])
1901: Röntgen
1902: Lorentz and Zeeman
1903: Becquerel and Curie and Curie, née Sklodowska
1904: (John William Strutt)
12
1905: von Lenard
1906: Thomson
1907: Michelson
1908: Lippmann
1909: Braun and Marconi
1910: van der Waals
1911: Wien
1912: Dalén
1913: Kamerlingh Onnes
1914: von Laue
1915: Bragg and Bragg
1917: Barkla
1918: Planck
1919: Stark
1920: Guillaume
1921: Einstein
1922: Bohr
1923: Millikan
1924: Siegbahn
1925: Franck and Hertz
1926: Perrin
1927: Compton and Wilson
1928: Richardson
1929: de Broglie
1930: Raman
1932: Heisenberg
1933: Dirac and Schrödinger
1935: Chadwick
1936: Anderson and Hess
1937: Davisson and Thomson
1938: Fermi
1939: Lawrence
1943: Stern
1944: Rabi
1945: Pauli
1946: Bridgman
1947: Appleton
1948: Blackett
1949: Yukawa
1950: Powell
1951: Cockcroft and Walton
1952: Bloch and Purcell
1953: Zernike
1954: Born and Bothe
1955: Kusch and Lamb
1956: Bardeen and Brattain and Shockley
1957: Lee and Yang
1958: Cherenkov and Frank and Tamm
13
1959: Chamberlain and Segrè
1960: Glaser
1961: Hofstadter and Mössbauer
1962: Landau
1963: Goeppert Mayer and Jensen and Wigner
1964: Basov and Prokhorov and Townes
1965: Feynman and Schwinger and Tomonaga
1966: Kastler
1967: Bethe
1968: Alvarez
1969: Gell-Mann
1970: Alfvén and Néel
1971: Gabor
1972: Bardeen and Cooper and Schrieffer
1973: Esaki and Giaever and Josephson
1974: Hewish and Ryle
1975: Bohr and Mottelson and Rainwater
1976: Richter and Ting
1977: Anderson and Mott and van Vleck
1978: Kapitsa and Penzias and Wilson
1979: Glashow and Salam and Weinberg
1980: Cronin and Fitch
1981: Bloembergen and Schawlow and Siegbahn
1982: Wilson
1983: Chandrasekhar and Fowler
1984: Rubbia and van der Meer
1985: von Klitzing
1986: Binnig and Rohrer and Ruska
1987: Bednorz and Müller
1988: Lederman and Schwartz and Steinberger
1989: Dehmelt and Paul and Ramsey
1990: Friedman and Kendall and Taylor
1991: de Gennes
1992: Charpak
1993: Hulse and Taylor Jr.
1994: Brockhouse and Shull
1995: Perl and Reines
1996: Lee and Osheroff and Richardson
1997: Chu and Cohen-Tannoudji and Phillips
1998: Laughlin and Störmer and Tsui
1999: 't Hooft and Veltman
2000: Alferov and Kilby and Kroemer
2001: Cornell and Ketterle and Wieman
2002: Davis Jr. and Giacconi and Koshiba
2003: Abrikosov and Ginzburg and Leggett
2004: Gross and Politzer and Wilczek
2005: Glauber and Hall and Hänsch
2006: Mather and Smoot
14
2007: Fert and Grünberg
2008: Kobayashi and Maskawa and Nambu
2009: Boyle and Kao and Smith
2010: Geim and Novoselov
2011: Perlmutter and Riess and Schmidt
2012: Haroche and Wineland
2013: Englert and Higgs
2014: Akasaki and Amano and Nakamura
2015: Kajita and McDonald
2016: Haldane and Kosterlitz and Thouless
2017: Barish and Thorne and Weiss
2018: Ashkin and Mourou and Strickland
[18]: # utilize sorting by multiple fields to see which prize categories are missing␣
,→in which years.
15
{'year': '2017', 'category': 'peace'}
{'year': '2017', 'category': 'physics'}
{'year': '2016', 'category': 'chemistry'}
{'year': '2016', 'category': 'economics'}
{'year': '2016', 'category': 'literature'}
{'year': '2016', 'category': 'medicine'}
{'year': '2016', 'category': 'peace'}
{'year': '2016', 'category': 'physics'}
{'year': '2015', 'category': 'chemistry'}
{'year': '2015', 'category': 'economics'}
{'year': '2015', 'category': 'literature'}
{'year': '2015', 'category': 'medicine'}
{'year': '2015', 'category': 'peace'}
{'year': '2015', 'category': 'physics'}
{'year': '2014', 'category': 'chemistry'}
{'year': '2014', 'category': 'economics'}
{'year': '2014', 'category': 'literature'}
{'year': '2014', 'category': 'medicine'}
{'year': '2014', 'category': 'peace'}
{'year': '2014', 'category': 'physics'}
{'year': '2013', 'category': 'chemistry'}
{'year': '2013', 'category': 'economics'}
{'year': '2013', 'category': 'literature'}
{'year': '2013', 'category': 'medicine'}
{'year': '2013', 'category': 'peace'}
{'year': '2013', 'category': 'physics'}
{'year': '2012', 'category': 'chemistry'}
{'year': '2012', 'category': 'economics'}
{'year': '2012', 'category': 'literature'}
{'year': '2012', 'category': 'medicine'}
{'year': '2012', 'category': 'peace'}
{'year': '2012', 'category': 'physics'}
{'year': '2011', 'category': 'chemistry'}
{'year': '2011', 'category': 'economics'}
{'year': '2011', 'category': 'literature'}
{'year': '2011', 'category': 'medicine'}
{'year': '2011', 'category': 'peace'}
{'year': '2011', 'category': 'physics'}
{'year': '2010', 'category': 'chemistry'}
{'year': '2010', 'category': 'economics'}
{'year': '2010', 'category': 'literature'}
{'year': '2010', 'category': 'medicine'}
{'year': '2010', 'category': 'peace'}
{'year': '2010', 'category': 'physics'}
{'year': '2009', 'category': 'chemistry'}
{'year': '2009', 'category': 'economics'}
{'year': '2009', 'category': 'literature'}
{'year': '2009', 'category': 'medicine'}
16
{'year': '2009', 'category': 'peace'}
{'year': '2009', 'category': 'physics'}
{'year': '2008', 'category': 'chemistry'}
{'year': '2008', 'category': 'economics'}
{'year': '2008', 'category': 'literature'}
{'year': '2008', 'category': 'medicine'}
{'year': '2008', 'category': 'peace'}
{'year': '2008', 'category': 'physics'}
{'year': '2007', 'category': 'chemistry'}
{'year': '2007', 'category': 'economics'}
{'year': '2007', 'category': 'literature'}
{'year': '2007', 'category': 'medicine'}
{'year': '2007', 'category': 'peace'}
{'year': '2007', 'category': 'physics'}
{'year': '2006', 'category': 'chemistry'}
{'year': '2006', 'category': 'economics'}
{'year': '2006', 'category': 'literature'}
{'year': '2006', 'category': 'medicine'}
{'year': '2006', 'category': 'peace'}
{'year': '2006', 'category': 'physics'}
{'year': '2005', 'category': 'chemistry'}
{'year': '2005', 'category': 'economics'}
{'year': '2005', 'category': 'literature'}
{'year': '2005', 'category': 'medicine'}
{'year': '2005', 'category': 'peace'}
{'year': '2005', 'category': 'physics'}
{'year': '2004', 'category': 'chemistry'}
{'year': '2004', 'category': 'economics'}
{'year': '2004', 'category': 'literature'}
{'year': '2004', 'category': 'medicine'}
{'year': '2004', 'category': 'peace'}
{'year': '2004', 'category': 'physics'}
{'year': '2003', 'category': 'chemistry'}
{'year': '2003', 'category': 'economics'}
{'year': '2003', 'category': 'literature'}
{'year': '2003', 'category': 'medicine'}
{'year': '2003', 'category': 'peace'}
{'year': '2003', 'category': 'physics'}
{'year': '2002', 'category': 'chemistry'}
{'year': '2002', 'category': 'economics'}
{'year': '2002', 'category': 'literature'}
{'year': '2002', 'category': 'medicine'}
{'year': '2002', 'category': 'peace'}
{'year': '2002', 'category': 'physics'}
{'year': '2001', 'category': 'chemistry'}
{'year': '2001', 'category': 'economics'}
{'year': '2001', 'category': 'literature'}
{'year': '2001', 'category': 'medicine'}
17
{'year': '2001', 'category': 'peace'}
{'year': '2001', 'category': 'physics'}
{'year': '2000', 'category': 'chemistry'}
{'year': '2000', 'category': 'economics'}
{'year': '2000', 'category': 'literature'}
{'year': '2000', 'category': 'medicine'}
{'year': '2000', 'category': 'peace'}
{'year': '2000', 'category': 'physics'}
{'year': '1999', 'category': 'chemistry'}
{'year': '1999', 'category': 'economics'}
{'year': '1999', 'category': 'literature'}
{'year': '1999', 'category': 'medicine'}
{'year': '1999', 'category': 'peace'}
{'year': '1999', 'category': 'physics'}
{'year': '1998', 'category': 'chemistry'}
{'year': '1998', 'category': 'economics'}
{'year': '1998', 'category': 'literature'}
{'year': '1998', 'category': 'medicine'}
{'year': '1998', 'category': 'peace'}
{'year': '1998', 'category': 'physics'}
{'year': '1997', 'category': 'chemistry'}
{'year': '1997', 'category': 'economics'}
{'year': '1997', 'category': 'literature'}
{'year': '1997', 'category': 'medicine'}
{'year': '1997', 'category': 'peace'}
{'year': '1997', 'category': 'physics'}
{'year': '1996', 'category': 'chemistry'}
{'year': '1996', 'category': 'economics'}
{'year': '1996', 'category': 'literature'}
{'year': '1996', 'category': 'medicine'}
{'year': '1996', 'category': 'peace'}
{'year': '1996', 'category': 'physics'}
{'year': '1995', 'category': 'chemistry'}
{'year': '1995', 'category': 'economics'}
{'year': '1995', 'category': 'literature'}
{'year': '1995', 'category': 'medicine'}
{'year': '1995', 'category': 'peace'}
{'year': '1995', 'category': 'physics'}
{'year': '1994', 'category': 'chemistry'}
{'year': '1994', 'category': 'economics'}
{'year': '1994', 'category': 'literature'}
{'year': '1994', 'category': 'medicine'}
{'year': '1994', 'category': 'peace'}
{'year': '1994', 'category': 'physics'}
{'year': '1993', 'category': 'chemistry'}
{'year': '1993', 'category': 'economics'}
{'year': '1993', 'category': 'literature'}
{'year': '1993', 'category': 'medicine'}
18
{'year': '1993', 'category': 'peace'}
{'year': '1993', 'category': 'physics'}
{'year': '1992', 'category': 'chemistry'}
{'year': '1992', 'category': 'economics'}
{'year': '1992', 'category': 'literature'}
{'year': '1992', 'category': 'medicine'}
{'year': '1992', 'category': 'peace'}
{'year': '1992', 'category': 'physics'}
{'year': '1991', 'category': 'chemistry'}
{'year': '1991', 'category': 'economics'}
{'year': '1991', 'category': 'literature'}
{'year': '1991', 'category': 'medicine'}
{'year': '1991', 'category': 'peace'}
{'year': '1991', 'category': 'physics'}
{'year': '1990', 'category': 'chemistry'}
{'year': '1990', 'category': 'economics'}
{'year': '1990', 'category': 'literature'}
{'year': '1990', 'category': 'medicine'}
{'year': '1990', 'category': 'peace'}
{'year': '1990', 'category': 'physics'}
{'year': '1989', 'category': 'chemistry'}
{'year': '1989', 'category': 'economics'}
{'year': '1989', 'category': 'literature'}
{'year': '1989', 'category': 'medicine'}
{'year': '1989', 'category': 'peace'}
{'year': '1989', 'category': 'physics'}
{'year': '1988', 'category': 'chemistry'}
{'year': '1988', 'category': 'economics'}
{'year': '1988', 'category': 'literature'}
{'year': '1988', 'category': 'medicine'}
{'year': '1988', 'category': 'peace'}
{'year': '1988', 'category': 'physics'}
{'year': '1987', 'category': 'chemistry'}
{'year': '1987', 'category': 'economics'}
{'year': '1987', 'category': 'literature'}
{'year': '1987', 'category': 'medicine'}
{'year': '1987', 'category': 'peace'}
{'year': '1987', 'category': 'physics'}
{'year': '1986', 'category': 'chemistry'}
{'year': '1986', 'category': 'economics'}
{'year': '1986', 'category': 'literature'}
{'year': '1986', 'category': 'medicine'}
{'year': '1986', 'category': 'peace'}
{'year': '1986', 'category': 'physics'}
{'year': '1985', 'category': 'chemistry'}
{'year': '1985', 'category': 'economics'}
{'year': '1985', 'category': 'literature'}
{'year': '1985', 'category': 'medicine'}
19
{'year': '1985', 'category': 'peace'}
{'year': '1985', 'category': 'physics'}
{'year': '1984', 'category': 'chemistry'}
{'year': '1984', 'category': 'economics'}
{'year': '1984', 'category': 'literature'}
{'year': '1984', 'category': 'medicine'}
{'year': '1984', 'category': 'peace'}
{'year': '1984', 'category': 'physics'}
{'year': '1983', 'category': 'chemistry'}
{'year': '1983', 'category': 'economics'}
{'year': '1983', 'category': 'literature'}
{'year': '1983', 'category': 'medicine'}
{'year': '1983', 'category': 'peace'}
{'year': '1983', 'category': 'physics'}
{'year': '1982', 'category': 'chemistry'}
{'year': '1982', 'category': 'economics'}
{'year': '1982', 'category': 'literature'}
{'year': '1982', 'category': 'medicine'}
{'year': '1982', 'category': 'peace'}
{'year': '1982', 'category': 'physics'}
{'year': '1981', 'category': 'chemistry'}
{'year': '1981', 'category': 'economics'}
{'year': '1981', 'category': 'literature'}
{'year': '1981', 'category': 'medicine'}
{'year': '1981', 'category': 'peace'}
{'year': '1981', 'category': 'physics'}
{'year': '1980', 'category': 'chemistry'}
{'year': '1980', 'category': 'economics'}
{'year': '1980', 'category': 'literature'}
{'year': '1980', 'category': 'medicine'}
{'year': '1980', 'category': 'peace'}
{'year': '1980', 'category': 'physics'}
{'year': '1979', 'category': 'chemistry'}
{'year': '1979', 'category': 'economics'}
{'year': '1979', 'category': 'literature'}
{'year': '1979', 'category': 'medicine'}
{'year': '1979', 'category': 'peace'}
{'year': '1979', 'category': 'physics'}
{'year': '1978', 'category': 'chemistry'}
{'year': '1978', 'category': 'economics'}
{'year': '1978', 'category': 'literature'}
{'year': '1978', 'category': 'medicine'}
{'year': '1978', 'category': 'peace'}
{'year': '1978', 'category': 'physics'}
{'year': '1977', 'category': 'chemistry'}
{'year': '1977', 'category': 'economics'}
{'year': '1977', 'category': 'literature'}
{'year': '1977', 'category': 'medicine'}
20
{'year': '1977', 'category': 'peace'}
{'year': '1977', 'category': 'physics'}
{'year': '1976', 'category': 'chemistry'}
{'year': '1976', 'category': 'economics'}
{'year': '1976', 'category': 'literature'}
{'year': '1976', 'category': 'medicine'}
{'year': '1976', 'category': 'peace'}
{'year': '1976', 'category': 'physics'}
{'year': '1975', 'category': 'chemistry'}
{'year': '1975', 'category': 'economics'}
{'year': '1975', 'category': 'literature'}
{'year': '1975', 'category': 'medicine'}
{'year': '1975', 'category': 'peace'}
{'year': '1975', 'category': 'physics'}
{'year': '1974', 'category': 'chemistry'}
{'year': '1974', 'category': 'economics'}
{'year': '1974', 'category': 'literature'}
{'year': '1974', 'category': 'medicine'}
{'year': '1974', 'category': 'peace'}
{'year': '1974', 'category': 'physics'}
{'year': '1973', 'category': 'chemistry'}
{'year': '1973', 'category': 'economics'}
{'year': '1973', 'category': 'literature'}
{'year': '1973', 'category': 'medicine'}
{'year': '1973', 'category': 'peace'}
{'year': '1973', 'category': 'physics'}
{'year': '1972', 'category': 'chemistry'}
{'year': '1972', 'category': 'economics'}
{'year': '1972', 'category': 'literature'}
{'year': '1972', 'category': 'medicine'}
{'year': '1972', 'category': 'physics'}
{'year': '1971', 'category': 'chemistry'}
{'year': '1971', 'category': 'economics'}
{'year': '1971', 'category': 'literature'}
{'year': '1971', 'category': 'medicine'}
{'year': '1971', 'category': 'peace'}
{'year': '1971', 'category': 'physics'}
{'year': '1970', 'category': 'chemistry'}
{'year': '1970', 'category': 'economics'}
{'year': '1970', 'category': 'literature'}
{'year': '1970', 'category': 'medicine'}
{'year': '1970', 'category': 'peace'}
{'year': '1970', 'category': 'physics'}
{'year': '1969', 'category': 'chemistry'}
{'year': '1969', 'category': 'economics'}
{'year': '1969', 'category': 'literature'}
{'year': '1969', 'category': 'medicine'}
{'year': '1969', 'category': 'peace'}
21
{'year': '1969', 'category': 'physics'}
{'year': '1968', 'category': 'chemistry'}
{'year': '1968', 'category': 'literature'}
{'year': '1968', 'category': 'medicine'}
{'year': '1968', 'category': 'peace'}
{'year': '1968', 'category': 'physics'}
{'year': '1967', 'category': 'chemistry'}
{'year': '1967', 'category': 'literature'}
{'year': '1967', 'category': 'medicine'}
{'year': '1967', 'category': 'physics'}
{'year': '1966', 'category': 'chemistry'}
{'year': '1966', 'category': 'literature'}
{'year': '1966', 'category': 'medicine'}
{'year': '1966', 'category': 'physics'}
{'year': '1965', 'category': 'chemistry'}
{'year': '1965', 'category': 'literature'}
{'year': '1965', 'category': 'medicine'}
{'year': '1965', 'category': 'peace'}
{'year': '1965', 'category': 'physics'}
{'year': '1964', 'category': 'chemistry'}
{'year': '1964', 'category': 'literature'}
{'year': '1964', 'category': 'medicine'}
{'year': '1964', 'category': 'peace'}
{'year': '1964', 'category': 'physics'}
{'year': '1963', 'category': 'chemistry'}
{'year': '1963', 'category': 'literature'}
{'year': '1963', 'category': 'medicine'}
{'year': '1963', 'category': 'peace'}
{'year': '1963', 'category': 'physics'}
{'year': '1962', 'category': 'chemistry'}
{'year': '1962', 'category': 'literature'}
{'year': '1962', 'category': 'medicine'}
{'year': '1962', 'category': 'peace'}
{'year': '1962', 'category': 'physics'}
{'year': '1961', 'category': 'chemistry'}
{'year': '1961', 'category': 'literature'}
{'year': '1961', 'category': 'medicine'}
{'year': '1961', 'category': 'peace'}
{'year': '1961', 'category': 'physics'}
{'year': '1960', 'category': 'chemistry'}
{'year': '1960', 'category': 'literature'}
{'year': '1960', 'category': 'medicine'}
{'year': '1960', 'category': 'peace'}
{'year': '1960', 'category': 'physics'}
{'year': '1959', 'category': 'chemistry'}
{'year': '1959', 'category': 'literature'}
{'year': '1959', 'category': 'medicine'}
{'year': '1959', 'category': 'peace'}
22
{'year': '1959', 'category': 'physics'}
{'year': '1958', 'category': 'chemistry'}
{'year': '1958', 'category': 'literature'}
{'year': '1958', 'category': 'medicine'}
{'year': '1958', 'category': 'peace'}
{'year': '1958', 'category': 'physics'}
{'year': '1957', 'category': 'chemistry'}
{'year': '1957', 'category': 'literature'}
{'year': '1957', 'category': 'medicine'}
{'year': '1957', 'category': 'peace'}
{'year': '1957', 'category': 'physics'}
{'year': '1956', 'category': 'chemistry'}
{'year': '1956', 'category': 'literature'}
{'year': '1956', 'category': 'medicine'}
{'year': '1956', 'category': 'physics'}
{'year': '1955', 'category': 'chemistry'}
{'year': '1955', 'category': 'literature'}
{'year': '1955', 'category': 'medicine'}
{'year': '1955', 'category': 'physics'}
{'year': '1954', 'category': 'chemistry'}
{'year': '1954', 'category': 'literature'}
{'year': '1954', 'category': 'medicine'}
{'year': '1954', 'category': 'peace'}
{'year': '1954', 'category': 'physics'}
{'year': '1953', 'category': 'chemistry'}
{'year': '1953', 'category': 'literature'}
{'year': '1953', 'category': 'medicine'}
{'year': '1953', 'category': 'peace'}
{'year': '1953', 'category': 'physics'}
{'year': '1952', 'category': 'chemistry'}
{'year': '1952', 'category': 'literature'}
{'year': '1952', 'category': 'medicine'}
{'year': '1952', 'category': 'peace'}
{'year': '1952', 'category': 'physics'}
{'year': '1951', 'category': 'chemistry'}
{'year': '1951', 'category': 'literature'}
{'year': '1951', 'category': 'medicine'}
{'year': '1951', 'category': 'peace'}
{'year': '1951', 'category': 'physics'}
{'year': '1950', 'category': 'chemistry'}
{'year': '1950', 'category': 'literature'}
{'year': '1950', 'category': 'medicine'}
{'year': '1950', 'category': 'peace'}
{'year': '1950', 'category': 'physics'}
{'year': '1949', 'category': 'chemistry'}
{'year': '1949', 'category': 'literature'}
{'year': '1949', 'category': 'medicine'}
{'year': '1949', 'category': 'peace'}
23
{'year': '1949', 'category': 'physics'}
{'year': '1948', 'category': 'chemistry'}
{'year': '1948', 'category': 'literature'}
{'year': '1948', 'category': 'medicine'}
{'year': '1948', 'category': 'physics'}
{'year': '1947', 'category': 'chemistry'}
{'year': '1947', 'category': 'literature'}
{'year': '1947', 'category': 'medicine'}
{'year': '1947', 'category': 'peace'}
{'year': '1947', 'category': 'physics'}
{'year': '1946', 'category': 'chemistry'}
{'year': '1946', 'category': 'literature'}
{'year': '1946', 'category': 'medicine'}
{'year': '1946', 'category': 'peace'}
{'year': '1946', 'category': 'physics'}
{'year': '1945', 'category': 'chemistry'}
{'year': '1945', 'category': 'literature'}
{'year': '1945', 'category': 'medicine'}
{'year': '1945', 'category': 'peace'}
{'year': '1945', 'category': 'physics'}
{'year': '1944', 'category': 'chemistry'}
{'year': '1944', 'category': 'literature'}
{'year': '1944', 'category': 'medicine'}
{'year': '1944', 'category': 'peace'}
{'year': '1944', 'category': 'physics'}
{'year': '1943', 'category': 'chemistry'}
{'year': '1943', 'category': 'medicine'}
{'year': '1943', 'category': 'physics'}
{'year': '1939', 'category': 'chemistry'}
{'year': '1939', 'category': 'literature'}
{'year': '1939', 'category': 'medicine'}
{'year': '1939', 'category': 'physics'}
{'year': '1938', 'category': 'chemistry'}
{'year': '1938', 'category': 'literature'}
{'year': '1938', 'category': 'medicine'}
{'year': '1938', 'category': 'peace'}
{'year': '1938', 'category': 'physics'}
{'year': '1937', 'category': 'chemistry'}
{'year': '1937', 'category': 'literature'}
{'year': '1937', 'category': 'medicine'}
{'year': '1937', 'category': 'peace'}
{'year': '1937', 'category': 'physics'}
{'year': '1936', 'category': 'chemistry'}
{'year': '1936', 'category': 'literature'}
{'year': '1936', 'category': 'medicine'}
{'year': '1936', 'category': 'peace'}
{'year': '1936', 'category': 'physics'}
{'year': '1935', 'category': 'chemistry'}
24
{'year': '1935', 'category': 'medicine'}
{'year': '1935', 'category': 'peace'}
{'year': '1935', 'category': 'physics'}
{'year': '1934', 'category': 'chemistry'}
{'year': '1934', 'category': 'literature'}
{'year': '1934', 'category': 'medicine'}
{'year': '1934', 'category': 'peace'}
{'year': '1933', 'category': 'literature'}
{'year': '1933', 'category': 'medicine'}
{'year': '1933', 'category': 'peace'}
{'year': '1933', 'category': 'physics'}
{'year': '1932', 'category': 'chemistry'}
{'year': '1932', 'category': 'literature'}
{'year': '1932', 'category': 'medicine'}
{'year': '1932', 'category': 'physics'}
{'year': '1931', 'category': 'chemistry'}
{'year': '1931', 'category': 'literature'}
{'year': '1931', 'category': 'medicine'}
{'year': '1931', 'category': 'peace'}
{'year': '1930', 'category': 'chemistry'}
{'year': '1930', 'category': 'literature'}
{'year': '1930', 'category': 'medicine'}
{'year': '1930', 'category': 'peace'}
{'year': '1930', 'category': 'physics'}
{'year': '1929', 'category': 'chemistry'}
{'year': '1929', 'category': 'literature'}
{'year': '1929', 'category': 'medicine'}
{'year': '1929', 'category': 'peace'}
{'year': '1929', 'category': 'physics'}
{'year': '1928', 'category': 'chemistry'}
{'year': '1928', 'category': 'literature'}
{'year': '1928', 'category': 'medicine'}
{'year': '1928', 'category': 'physics'}
{'year': '1927', 'category': 'chemistry'}
{'year': '1927', 'category': 'literature'}
{'year': '1927', 'category': 'medicine'}
{'year': '1927', 'category': 'peace'}
{'year': '1927', 'category': 'physics'}
{'year': '1926', 'category': 'chemistry'}
{'year': '1926', 'category': 'literature'}
{'year': '1926', 'category': 'medicine'}
{'year': '1926', 'category': 'peace'}
{'year': '1926', 'category': 'physics'}
{'year': '1925', 'category': 'chemistry'}
{'year': '1925', 'category': 'literature'}
{'year': '1925', 'category': 'peace'}
{'year': '1925', 'category': 'physics'}
{'year': '1924', 'category': 'literature'}
25
{'year': '1924', 'category': 'medicine'}
{'year': '1924', 'category': 'physics'}
{'year': '1923', 'category': 'chemistry'}
{'year': '1923', 'category': 'literature'}
{'year': '1923', 'category': 'medicine'}
{'year': '1923', 'category': 'physics'}
{'year': '1922', 'category': 'chemistry'}
{'year': '1922', 'category': 'literature'}
{'year': '1922', 'category': 'medicine'}
{'year': '1922', 'category': 'peace'}
{'year': '1922', 'category': 'physics'}
{'year': '1921', 'category': 'chemistry'}
{'year': '1921', 'category': 'literature'}
{'year': '1921', 'category': 'peace'}
{'year': '1921', 'category': 'physics'}
{'year': '1920', 'category': 'chemistry'}
{'year': '1920', 'category': 'literature'}
{'year': '1920', 'category': 'medicine'}
{'year': '1920', 'category': 'peace'}
{'year': '1920', 'category': 'physics'}
{'year': '1919', 'category': 'literature'}
{'year': '1919', 'category': 'medicine'}
{'year': '1919', 'category': 'peace'}
{'year': '1919', 'category': 'physics'}
{'year': '1918', 'category': 'chemistry'}
{'year': '1918', 'category': 'physics'}
{'year': '1917', 'category': 'literature'}
{'year': '1917', 'category': 'peace'}
{'year': '1917', 'category': 'physics'}
{'year': '1916', 'category': 'literature'}
{'year': '1915', 'category': 'chemistry'}
{'year': '1915', 'category': 'literature'}
{'year': '1915', 'category': 'physics'}
{'year': '1914', 'category': 'chemistry'}
{'year': '1914', 'category': 'medicine'}
{'year': '1914', 'category': 'physics'}
{'year': '1913', 'category': 'chemistry'}
{'year': '1913', 'category': 'literature'}
{'year': '1913', 'category': 'medicine'}
{'year': '1913', 'category': 'peace'}
{'year': '1913', 'category': 'physics'}
{'year': '1912', 'category': 'chemistry'}
{'year': '1912', 'category': 'literature'}
{'year': '1912', 'category': 'medicine'}
{'year': '1912', 'category': 'peace'}
{'year': '1912', 'category': 'physics'}
{'year': '1911', 'category': 'chemistry'}
{'year': '1911', 'category': 'literature'}
26
{'year': '1911', 'category': 'medicine'}
{'year': '1911', 'category': 'peace'}
{'year': '1911', 'category': 'physics'}
{'year': '1910', 'category': 'chemistry'}
{'year': '1910', 'category': 'literature'}
{'year': '1910', 'category': 'medicine'}
{'year': '1910', 'category': 'peace'}
{'year': '1910', 'category': 'physics'}
{'year': '1909', 'category': 'chemistry'}
{'year': '1909', 'category': 'literature'}
{'year': '1909', 'category': 'medicine'}
{'year': '1909', 'category': 'peace'}
{'year': '1909', 'category': 'physics'}
{'year': '1908', 'category': 'chemistry'}
{'year': '1908', 'category': 'literature'}
{'year': '1908', 'category': 'medicine'}
{'year': '1908', 'category': 'peace'}
{'year': '1908', 'category': 'physics'}
{'year': '1907', 'category': 'chemistry'}
{'year': '1907', 'category': 'literature'}
{'year': '1907', 'category': 'medicine'}
{'year': '1907', 'category': 'peace'}
{'year': '1907', 'category': 'physics'}
{'year': '1906', 'category': 'chemistry'}
{'year': '1906', 'category': 'literature'}
{'year': '1906', 'category': 'medicine'}
{'year': '1906', 'category': 'peace'}
{'year': '1906', 'category': 'physics'}
{'year': '1905', 'category': 'chemistry'}
{'year': '1905', 'category': 'literature'}
{'year': '1905', 'category': 'medicine'}
{'year': '1905', 'category': 'peace'}
{'year': '1905', 'category': 'physics'}
{'year': '1904', 'category': 'chemistry'}
{'year': '1904', 'category': 'literature'}
{'year': '1904', 'category': 'medicine'}
{'year': '1904', 'category': 'peace'}
{'year': '1904', 'category': 'physics'}
{'year': '1903', 'category': 'chemistry'}
{'year': '1903', 'category': 'literature'}
{'year': '1903', 'category': 'medicine'}
{'year': '1903', 'category': 'peace'}
{'year': '1903', 'category': 'physics'}
{'year': '1902', 'category': 'chemistry'}
{'year': '1902', 'category': 'literature'}
{'year': '1902', 'category': 'medicine'}
{'year': '1902', 'category': 'peace'}
{'year': '1902', 'category': 'physics'}
27
{'year': '1901', 'category': 'chemistry'}
{'year': '1901', 'category': 'literature'}
{'year': '1901', 'category': 'medicine'}
{'year': '1901', 'category': 'peace'}
{'year': '1901', 'category': 'physics'}
5.3 Indexing
An index in MongoDB is a special data structure that holds the data of some fields of documents
on which the index is created. - Indexes improve the speed of search operations in database because
instead of searching the whole document, the search is performed on the indexes that holds only
few fields. - On the other hand, having too many indexes can hamper the performance of insert,
update and delete operations because of the additional write and additional data space used by
indexes.
When to use index - First, when you expect to get only one or a few documents back. If your
typical queries fetch most if not all documents, you might as well scan the whole collection. Making
Mongo maintain an index is a waste of time. - Second, when you have very large documents or
very large collections. Rather than load these into memory from disk, Mongo can use much-smaller
indexes.
Index operations - Create index - An index model is a list of (field, direction) pairs, where
direction is either 1 (ascending) or -1 (descending). - db.collection_name.createIndex({field_name:
1 or -1})
• Finding index in a collection
– db.collection_name.getIndexes()
• Droping index
– db.collection_name.dropIndex({index_name: 1})
– db.collection_name.dropIndexes()
[20]: # Example of creating an index that speeds up finding prizes by category and
# then sorting results by decreasing year
# for each distinct prize category, find the latest-year prize (requiring a␣
,→descending sort by year)
28
# of that category (so, find matches for that category) with a laureate share␣
,→of "1".
print(report)
chemistry: 2011
economics: 2017
literature: 2017
medicine: 2016
peace: 2017
physics: 1992
[4]: # Some countries are, for one or more laureates, both their country of birth␣
,→("bornCountry") and
# You will find the five countries of birth with the highest counts of such␣
,→laureates.
laureates.create_index([("bornCountry", 1)])
five_most_common = Counter(n_born_and_affiliated).most_common(5)
print(five_most_common)
29
5.4 Limits
To limit the result in MongoDB, we use the limit() method. The limit() method takes one param-
eter, a number defining how many documents to return.
Besides limiting the number of results, we can also skip results server-side. When you use the
“skip” parameter in conjunction with limits, you can get pagination, with the number of results
per page set by the limit parameter.
[7]: # Find the first five prizes with one or more laureates sharing 1/4 of the␣
,→prize.
# Save to filter_ the filter document to fetch only prizes with one or more␣
,→quarter-share laureates,
# Save to projection the list of field names so that prize category, year and
# laureates' motivations ("laureates.motivation") may be fetched for inspection.
projection = ['category', 'year', 'laureates.motivation']
# Save to cursor a cursor that will yield prizes, sorted by ascending year.
# Limit this to five prizes, and sort using the most concise specification.
cursor = prizes.find(filter_, projection).sort("year").limit(5)
pprint(list(cursor))
[{'_id': ObjectId('5fcf861e166123acf44a9fc0'),
'category': 'physics',
'laureates': [{'motivation': '"in recognition of the extraordinary services '
'he has rendered by his discovery of '
'spontaneous radioactivity"'},
{'motivation': '"in recognition of the extraordinary services '
'they have rendered by their joint researches '
'on the radiation phenomena discovered by '
'Professor Henri Becquerel"'},
{'motivation': '"in recognition of the extraordinary services '
'they have rendered by their joint researches '
'on the radiation phenomena discovered by '
'Professor Henri Becquerel"'}],
'year': '1903'},
{'_id': ObjectId('5fcf861e166123acf44a9f67'),
'category': 'chemistry',
'laureates': [{'motivation': '"for his discovery that enzymes can be '
'crystallized"'},
{'motivation': '"for their preparation of enzymes and virus '
'proteins in a pure form"'},
30
{'motivation': '"for their preparation of enzymes and virus '
'proteins in a pure form"'}],
'year': '1946'},
{'_id': ObjectId('5fcf861e166123acf44a9f40'),
'category': 'medicine',
'laureates': [{'motivation': '"for their discovery of the course of the '
'catalytic conversion of glycogen"'},
{'motivation': '"for their discovery of the course of the '
'catalytic conversion of glycogen"'},
{'motivation': '"for his discovery of the part played by the '
'hormone of the anterior pituitary lobe in the '
'metabolism of sugar"'}],
'year': '1947'},
{'_id': ObjectId('5fcf861e166123acf44a9f21'),
'category': 'medicine',
'laureates': [{'motivation': '"for their discovery that genes act by '
'regulating definite chemical events"'},
{'motivation': '"for their discovery that genes act by '
'regulating definite chemical events"'},
{'motivation': '"for his discoveries concerning genetic '
'recombination and the organization of the '
'genetic material of bacteria"'}],
'year': '1958'},
{'_id': ObjectId('5fcf861e166123acf44a9f01'),
'category': 'physics',
'laureates': [{'motivation': '"for his contributions to the theory of the '
'atomic nucleus and the elementary particles, '
'particularly through the discovery and '
'application of fundamental symmetry '
'principles"'},
{'motivation': '"for their discoveries concerning nuclear '
'shell structure"'},
{'motivation': '"for their discoveries concerning nuclear '
'shell structure"'}],
'year': '1963'}]
6 Aggregation pipeline
The aggregation pipeline is a framework for data aggregation modeled on the concept of data
processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into
aggregated results. Aggregation pipelines can be constructed for flexible and powerful analyses.
The MongoDB aggregation pipeline consists of a list / sequence of stages. Each stage transforms
the documents as they pass through the pipeline.
db.collection.aggregate([stage_1,stage_2, …])
31
Various stages in pipeline are:
• $project – select, reshape data
• $match – filter data
• $group – aggregate data
• $sort – sorts data
• $skip – skips data
• $limit – limit data
• $unwind – normalizes data
# prize documents for all original categories (that is, $in categories awarded␣
,→in 1901).
# Project only the prize year and category (including document _id is fine).
32
# The aggregation cursor will be fed to Python's itertools.groupby function to␣
,→group prizes by year.
# For each year that at least one of the original prize categories was missing,
# a line with all missing categories for that year will be printed.
2018: literature
1972: peace
1967: peace
1966: peace
1956: peace
1955: peace
1948: peace
1943: literature, peace
1939: peace
1935: literature
1934: physics
1933: chemistry
1932: peace
1931: physics
1928: peace
1925: medicine
1924: chemistry, peace
1923: peace
1921: medicine
1919: chemistry
1918: literature, medicine, peace
1917: chemistry, medicine
1916: chemistry, medicine, peace, physics
33
1915: medicine, peace
1914: literature, peace
[14]: # Fill out pipeline to determine the number of prizes awarded (at least partly)␣
,→to organizations.
# Then, use a field path to project the number of prizes for each organization
# as the "$size" of the "prizes" array.
# Recall that to specify the value of a field "<my_field>", you use the field␣
,→path "$<my_field>".
# Finally, use a single group {"_id": None} to sum over values of all␣
,→organizations' prize counts
pipeline = [
{"$match": {"gender": "org"}},
{"$project": {"n_prizes": {"$size": "$prizes"}}},
{"$group": {"_id": None, "n_prizes_total": {"$sum": "$n_prizes"}}}
]
print(list(laureates.aggregate(pipeline)))
34
# with the set of categories awarded that year.
{"$group": {"_id": "$year", "categories": {"$addToSet": "$category"}}},
# Project categories *not* awarded (i.e., that are missing this year).
# Given your intermediate collection of year-keyed documents,
# $project a field named "missing" with the (original) categories not␣
,→awarded that year.
2018: literature
1972: peace
1967: peace
1966: peace
1956: peace
1955: peace
1948: peace
1943: literature, peace
1939: peace
1935: literature
1934: physics
1933: chemistry
1932: peace
1931: physics
1928: peace
1925: medicine
1924: chemistry, peace
1923: peace
1921: medicine
1919: chemistry
1918: literature, medicine, peace
1917: chemistry, medicine
35
1916: chemistry, medicine, peace, physics
1915: medicine, peace
1914: literature, peace
[18]: # Build an aggregation pipeline to get the count of laureates who either did or␣
,→did not win a prize
# the prize affiliation country "Germany" should match the country of birth␣
,→"Prussia (now Germany)".
key_ac = "prizes.affiliations.country"
key_bc = "bornCountry"
pipeline = [
{"$project": {key_bc: 1, key_ac: 1}},
[21]: # Some prize categories have laureates hailing from a greater number of␣
,→countries than
# do other categories. You will build an aggregation pipeline for the prizes␣
,→collection to
# collect these numbers, using a $lookup stage to obtain laureate countries of␣
,→birth.
36
pipeline = [
# Unwind the laureates array
# $unwind the laureates array field to output one pipeline document for␣
,→each array element.
{"$unwind": "$laureates"},
{"$lookup": {
"from": "laureates", "foreignField": "id",
"localField": "laureates.id", "as": "laureate_bios"}},
[ ]:
37