Jupyter Notebook Basics
Jupyter Notebook Basics
Beginner’s Tutorial
Published: August 24, 2020
Installation
The easiest way for a beginner to get started with Jupyter Notebooks is by installing Anaconda.
Anaconda is the most widely used Python distribution for data science and comes pre-loaded with all the
most popular libraries and tools.
Some of the biggest Python libraries included in Anaconda include NumPy, pandas, and Matplotlib, though
the full 1000+ list is exhaustive.
Anaconda thus lets us hit the ground running with a fully stocked data science workshop without the hassle
of managing countless installations or worrying about dependencies and OS-specific (read: Windows-
specific) installation issues.
To get Anaconda, simply:
1. Download the latest version of Anaconda for Python 3.8.
2. Install Anaconda by following the instructions on the download page and/or in the executable.
If you are a more advanced user with Python already installed and prefer to manage your packages
manually, you can just use pip:
Hey presto, here we are! Your first Jupyter Notebook will open in new tab — each notebook uses its own tab
because you can open multiple notebooks simultaneously.
If you switch back to the dashboard, you will see the new file Untitled.ipynb and you should see some green
text that tells you your notebook is running.
What is an ipynb File?
The short answer: each .ipynb file is one notebook, so each time you create a new notebook, a new .ipynb file
will be created.
The longer answer: Each .ipynb file is a text file that describes the contents of your notebook in a format
called JSON. Each cell and its contents, including image attachments that have been converted into strings
of text, is listed therein along with some metadata.
You can edit this yourself — if you know what you are doing! — by selecting “Edit > Edit Notebook Metadata”
from the menu bar in the notebook. You can also view the contents of your notebook files by selecting “Edit”
from the controls on the dashboard
However, the key word there is can. In most cases, there’s no reason you should ever need to edit your
notebook metadata manually.
The Notebook Interface
Now that you have an open notebook in front of you, its interface will hopefully not look entirely alien. After
all, Jupyter is essentially just an advanced word processor.
Why not take a look around? Check out the menus to get a feel for it, especially take a few moments to scroll
down the list of commands in the command palette, which is the small button with the keyboard icon (or Ctrl +
Shift + P).
There are two fairly prominent terms that you should notice, which are probably new to
you: cells and kernels are key both to understanding Jupyter and to what makes it more than just a word
processor. Fortunately, these concepts are not difficult to understand.
A kernel is a “computational engine” that executes the code contained in a notebook document.
A cell is a container for text to be displayed in the notebook or code to be executed by the notebook’s
kernel.
Cells
We’ll return to kernels a little later, but first let’s come to grips with cells. Cells form the body of a notebook. In
the screenshot of a new notebook in the section above, that box with the green outline is an empty cell.
There are two main cell types that we will cover:
A code cell contains code to be executed in the kernel. When the code is run, the notebook displays the
output below the code cell that generated it.
A Markdown cell contains text formatted using Markdown and displays its output in-place when the
Markdown cell is run.
The first cell in a new notebook is always a code cell.
Let’s test it out with a classic hello world example: Type print('Hello World!') into the cell and click the run
print('Hello World!')
Hello World!
When we run the cell, its output is displayed below and the label to its left will have changed from In [ ] to In
[1].
The output of a code cell also forms part of the document, which is why you can see it in this article. You can
always tell the difference between code and Markdown cells because code cells have that label on the left
and Markdown cells do not.
The “In” part of the label is simply short for “Input,” while the label number indicates when the cell was
executed on the kernel — in this case the cell was executed first.
Run the cell again and the label will change to In [2] because now the cell was the second to be run on the
kernel. It will become clearer why this is so useful later on when we take a closer look at kernels.
From the menu bar, click Insert and select Insert Cell Below to create a new code cell underneath your first
and try out the following code to see what happens. Do you notice anything different?
import time
time.sleep(3)
This cell doesn’t produce any output, but it does take three seconds to execute. Notice how Jupyter signifies
when the cell is currently running by changing its label to In [*].
In general, the output of a cell comes from any text data specifically printed during the cell’s execution, as
well as the value of the last line in the cell, be it a lone variable, a function call, or something else. For
example:
def say_hello(recipient):
return 'Hello, {}!'.format(recipient)
say_hello('Tim')
'Hello, Tim!'
You’ll find yourself using this almost constantly in your own projects, and we’ll see more of it later on.
Keyboard Shortcuts
One final thing you may have observed when running your cells is that their border turns blue, whereas it
was green while you were editing. In a Jupyter Notebook, there is always one “active” cell highlighted with a
border whose color denotes its current mode:
Green outline — cell is in “edit mode”
Blue outline — cell is in “command mode”
So what can we do to a cell when it’s in command mode? So far, we have seen how to run a cell with Ctrl +
Enter, but there are plenty of other commands we can use. The best way to use them is with keyboard
shortcuts
Keyboard shortcuts are a very popular aspect of the Jupyter environment because they facilitate a speedy
cell-based workflow. Many of these are actions you can carry out on the active cell when it’s in command
mode.
Below, you’ll find a list of some of Jupyter’s keyboard shortcuts. You don’t need to memorize them all
immediately, but this list should give you a good idea of what’s possible.
Toggle between edit and command mode with Esc and Enter, respectively.
Once in command mode:
Scroll up and down your cells with your Up and Down keys.
Press A or B to insert a new cell above or below the active cell.
M will transform the active cell to a Markdown cell.
Y will set the active cell to a code cell.
D + D (D twice) will delete the active cell.
Z will undo cell deletion.
Hold Shift and press Up or Down to select multiple cells at once. With multiple cells selected, Shift +
M will merge your selection.
Ctrl + Shift + -, in edit mode, will split the active cell at the cursor.
You can also click and Shift + Click in the margin to the left of your cells to select them.
Go ahead and try these out in your own notebook. Once you’re ready, create a new Markdown cell and we’ll
learn how to format the text in our notebooks.
Markdown
Markdown is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-
one correspondence with HTML tags, so some prior knowledge here would be helpful but is definitely not a
prerequisite.
Remember that this article was written in a Jupyter notebook, so all of the narrative text and images you
have seen so far were achieved writing in Markdown. Let’s cover the basics with a quick example:
This is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_.
Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
```
bar()
```
Or can be indented by 4 spaces:
foo()
Here’s how that Markdown would look once you run the cell to render it:
(Note that the alt text for the image is displayed here because we didn’t actually use a valid image URL in
our example)
When attaching images, you have three options:
Use a URL to an image on the web.
Use a local URL to an image that you will be keeping alongside your notebook, such as in the same git
repo.
Add an attachment via “Edit > Insert Image”; this will convert the image into a string and store it inside
your notebook .ipynb file. Note that this will make your .ipynb file much larger!
There is plenty more to Markdown, especially around hyperlinking, and it’s also possible to simply include
plain HTML. Once you find yourself pushing the limits of the basics above, you can refer to the official
guide from Markdown’s creator, John Gruber, on his website.
Kernels
Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel. Any
output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells —
it pertains to the document as a whole and not individual cells.
For example, if you import libraries or declare variables in one cell, they will be available in another. Let’s try
this out to get a feel for it. First, we’ll import a Python package and define a function:
import numpy as np
def square(x):
return x * x
Once we’ve executed the cell above, we can reference np and square in any other cell.
x = np.random.randint(1, 10)
y = square(x)
print('%d squared is %d' % (x, y))
1 squared is 1
This will work regardless of the order of the cells in your notebook. As long as a cell has been run, any
variables you declared or libraries you imported will be available in other cells.
You can try it yourself, let’s print out our variables again.
y = 10
print('Is %d squared is %d?' % (x, y))
You may have noticed that Jupyter gives you the option to change kernel, and in fact there are many different
options to choose from. Back when you created a new notebook from the dashboard by selecting a Python
version, you were actually choosing which kernel to use.
There kernels for different versions of Python, and also for over 100 languages including Java, C, and even
Fortran. Data scientists may be particularly interested in the kernels for R and Julia, as well as
both imatlab and the Calysto MATLAB Kernel for Matlab.
The SoS kernel provides multi-language support within a single notebook.
Each kernel has its own installation instructions, but will likely require you to run some commands on your
computer.
Example Analysis
Now we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which
should give us clearer understanding of why they are so popular.
It’s finally time to get started with that Fortune 500 data set mentioned earlier. Remember, our goal is to find
out how the profits of the largest companies in the US changed historically.
It’s worth noting that everyone will develop their own preferences and style, but the general principles still
apply. You can follow along with this section in your own notebook if you wish, or use this as a guide to
creating your own approach.
Naming Your Notebooks
Before you start writing your project, you’ll probably want to give it a meaningful name. file name Untitled in
the upper left of the screen to enter a new file name, and hit the Save icon (which looks like a floppy disk)
below it to save.
Note that closing the notebook tab in your browser will not “close” your notebook in the way closing a
document in a traditional application will. The notebook’s kernel will continue to run in the background and
needs to be shut down before it is truly “closed” — though this is pretty handy if you accidentally close your
tab or browser!
If the kernel is shut down, you can close the tab without worrying about whether it is still running or not.
The easiest way to do this is to select “File > Close and Halt” from the notebook menu. However, you can
also shutdown the kernel either by going to “Kernel > Shutdown” from within the notebook app or by
selecting the notebook in the dashboard and clicking “Shutdown” (see image below).
Setup
It’s common to start off with a code cell specifically for imports and setup, so that if you choose to add or
change anything, you can simply edit and re-run the cell without causing any side-effects.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns sns.set(style="darkgrid")
We’ll import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.
It’s also common to import NumPy but in this case, pandas imports it for us.
That first line isn’t a Python command, but uses something called a line magic to instruct Jupyter to capture
Matplotlib plots and render them in the cell output. We’ll talk a bit more about line magics later, and they’re
also covered in our advanced Jupyter Notebooks tutorial.
For now, let’s go ahead and load our data.
df = pd.read_csv('fortune500.csv')
It’s sensible to also do this in a single cell, in case we need to reload it at any point.
Save and Checkpoint
Now we’ve got started, it’s best practice to save regularly. Pressing Ctrl + S will save our notebook by calling
the “Save and Checkpoint” command, but what is this checkpoint thing?
Every time we create a new notebook, a checkpoint file is created along with the notebook file. It is located
within a hidden subdirectory of your save location called .ipynb_checkpoints and is also a .ipynb file.
By default, Jupyter will autosave your notebook every 120 seconds to this checkpoint file without altering
your primary notebook file. When you “Save and Checkpoint,” both the notebook and checkpoint files are
updated. Hence, the checkpoint enables you to recover your unsaved work in the event of an unexpected
issue.
You can revert to the checkpoint from the menu via “File > Revert to Checkpoint.”
Investigating Our Data Set
Now we’re really rolling! Our notebook is safely saved and we’ve loaded our data set df into the most-used
pandas data structure, which is called a DataFrame and basically looks like a table. What does ours look like?
df.head()
df.tail()
Year Rank Company Revenue (in millions) Profit (in millions)
Looking good. We have the columns we need, and each row corresponds to a single company in a single
year.
Let’s just rename those columns so we can refer to them later.
Next, we need to explore our data set. Is it complete? Did pandas read it as expected? Are any values
missing?
len(df)
25500
Okay, that looks good — that’s 500 rows for every year from 1955 to 2005, inclusive.
Let’s check whether our data set has been imported as we would expect. A simple check is to see if the data
types (or dtypes) have been correctly interpreted.
df.dtypes
year int64 rank int64 company object revenue float64 profit object dtype: object
Uh oh. It looks like there’s something wrong with the profits column — we would expect it to be a float64 like
the revenue column. This indicates that it probably contains some non-integer values, so let’s take a look.
non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()
Just as we suspected! Some of the values are strings, which have been used to indicate missing data. Are
there any other values that have crept in?
set(df.profit[non_numberic_profits])
{'N.A.'}
That makes it easy to interpret, but what should we do? Well, that depends how many values are missing.
len(df.profit[non_numberic_profits])
369
It’s a small fraction of our data set, though not completely inconsequential as it is still around 1.5%.
If rows containing N.A. are, roughly, uniformly distributed over the years, the easiest solution would just be to
remove them. So let’s have a quick look at the distribution.
At a glance, we can see that the most invalid values in a single year is fewer than 25, and as there are 500
data points per year, removing these values would account for less than 4% of the data for the worst years.
Indeed, other than a surge around the 90s, most years have fewer than half the missing values of the peak.
For our purposes, let’s say this is acceptable and go ahead and remove these rows.
df = df.loc[~non_numberic_profits]
df.profit = df.profit.apply(pd.to_numeric)
We should check that worked.
len(df)
25131
df.dtypes
year int64 rank int64 company object revenue float64 profit float64 dtype: object
fig, ax = plt.subplots()
plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2005', 'Profit (millions)')
Wow, that looks like an exponential, but it’s got some huge dips. They must correspond to the early 1990s
recession and the dot-com bubble. It’s pretty interesting to see that in the data. But how come profits
recovered to even higher levels post each recession?
Maybe the revenues can tell us more.
y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2005', 'Revenue (millions)')
That adds another side to the story. Revenues were not as badly hit — that’s some great accounting work
from the finance departments.
With a little help from Stack Overflow, we can superimpose these plots with +/- their standard deviations.
def plot_with_std(x, y, stds, ax, title, y_label):
ax.fill_between(x, y - stds, y + stds, alpha=0.2)
plot(x, y, ax, title, y_label)
fig, (ax1, ax2) = plt.subplots(ncols=2)
title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2005'
stds1 = group_by_year.std().profit.values
stds2 = group_by_year.std().revenue.values
plot_with_std(x, y1.values, stds1, ax1, title % 'profits', 'Profit (millions)')
plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
fig.set_size_inches(14, 4)
fig.tight_layout()
That’s staggering, the standard deviations are huge! Some Fortune 500 companies make billions while
others lose billions, and the risk has increased along with rising profits over the years.
Perhaps some companies perform better than others; are the profits of the top 10% more or less volatile than
the bottom 10%?
There are plenty of questions that we could look into next, and it’s easy to see how the flow of working in a
notebook can match one’s own thought process. For the purposes of this tutorial, we’ll stop our analysis
here, but feel free to continue digging into the data on your own!
This flow helped us to easily investigate our data set in one place without context switching between
applications, and our work is immediately shareable and reproducible. If we wished to create a more concise
report for a particular audience, we could quickly refactor our work by merging cells and removing
intermediary code.
Once you’ve done that, start up a notebook and you should seen an Nbextensions tab. Clicking this tab will
show you a list of available extensions. Simply tick the boxes for the extensions you want to enable, and
you’re off to the races!
Installing Extensions
Once Nbextensions itself has been installed, there’s no need for additional installation of each extension.
However, if you’ve already installed Nbextensons but aren’t seeing the tab, you’re not alone. This thread on
Github details some common issues and solutions.
%matplotlib?
When you run the above cell in a notebook, a lengthy docstring will pop up onscreen with details about how
you can use the magic.
A Few Useful Magic Commands
We cover more in the advanced Jupyter tutorial, but here are a few to get you started:
Magic
Command What it does
%run Runs an external script file as part of the cell being executed.
%timeit Counts loops, measures and reports how long a code cell takes to execute.
There’s plenty more where that came from. Hop into Jupyter Notebooks and start exploring using %lsmagic!
Final Thoughts
Starting from scratch, we have come to grips with the natural workflow of Jupyter Notebooks, delved into
IPython’s more advanced features, and finally learned how to share our work with friends, colleagues, and
the world. And we accomplished all this from a notebook itself!
It should be clear how notebooks promote a productive working experience by reducing context switching
and emulating a natural development of thoughts during a project. The power of using Jupyter Notebooks
should also be evident, and we covered plenty of leads to get you started exploring more advanced
features in your own projects.
If you’d like further inspiration for your own Notebooks, Jupyter has put together a gallery of interesting
Jupyter Notebooks that you may find helpful and the Nbviewer homepage links to some really fancy
examples of quality notebooks.