How To Install Jupyter Notebook On Ubuntu: Getting Started

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 95

How to Install Jupyter Notebook on Ubuntu

Jupyter Notebook is one of the most widely used tool to execute Python interactively directly
from a browser. With Jupyter Notebooks, we have an excellent opportunity to mix code with
interactive exercises and documentation which doesn’t restrict us to keep our comments with #
symbol and also allows to see the output of small snippet of our code directly in our browser.
With the IPython 4.0 release, the language-independent parts of the project: the notebook style,
message protocol, qtconsole, notebook web application, etc. have moved to a new project under
the name Jupyter.

Getting Started
We will start by installing most basic components for this lesson which are Python and PIP.
Let’s start with basic updates in our machine:
sudo apt-get update
Here is what we get back with this command:

Update machine
Next, we can install required components in a single command:
sudo apt-get -y install python2.7 python-pip python-dev
This installation might take some time to install depending on network speed as many
dependencies are being installed here. We are using Python 2.7 with PIP package manager with
which we can use and install many other python modules as we go. Finally, many of Jupyter’s
dependencies work on Python-C extension, so we installed the python-dev dependency as well.
To verify that everything went well, let us check the Python & PIP version with these
commands:
python --version
pip --version
We will get back:

Python & PIP Version

Installing IPython & Juptyr


We can move on to installing most important parts of this lesson. Let us run the following
commands to install IPython & Juptyr on our machine:
sudo apt-get -y install ipython
sudo -H pip install jupyter
Again, even this command can take more time to complete depending on network speed. Once
this command is complete running, we can finally start the jupyter notebook as:
jupyter notebook --allow-root
You can start Jupyter as a non-root user as well. If you start Jupyter as a root user, we will have
to use the –allow-root flag. Once this starts, the terminal will look like this:
Start Jupyter
Accessing Jupyter Console
We can access the Jupyter server from the URL shown in the console. If you’re running this
server on a remote server, we will have to tunnel into the IP through SSH. We can do this with
this command:
ssh -L 8000:localhost:8888 your_server_username@your_server_ip
Once you execute this command, you can again SSH into your server and execute the command
to start Jupyter notebook with this command:
jupyter notebook --allow-root
Accessing Jupyter Notebook
Once you’ve started the Jupyter notebook server through SSH, you will see a URL like this in
the terminal window:
http://localhost:8888/?token=8acc13e5fa48059e8ba0512bb54c616ed8ea100d5e25f639&token=8
acc13e5fa48059e8ba0512bb54c616ed8ea100d5e25f639
Take note of one of the tokens in the URL. Now, open the following URL in your local machine
browser:
http://localhost:8000/
You will see something like:
Accessing Jupyter
Once you’re in this screen, we can input the token we collected in previous step in the provided
field. We will be inside once we hit Enter. We will see a screen like this:

Jupyter Notebook
We can now create a new notebook:
Creating new notebook
We can provide a name to this notebook by clicking on title bar:
Naming a Notebook
Finally, you can write sample Python code and execute in the browser itself:
Here is a short snippet that should generally work:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy

Installing Python Packages from a Jupyter


Notebook
Tue 05 December 2017

In software, it's said that all abstractions are leaky, and this is true for the Jupyter notebook as it
is for any other software. I most often see this manifest itself with the following issue:
I installed package X and now I can't import it in the notebook. Help!
This issue is a perrennial source of StackOverflow questions (e.g. this, that, here, there, another,
this one, that one, and this... etc.).
Fundamentally the problem is usually rooted in the fact that the Jupyter kernels are
disconnected from Jupyter's shell; in other words, the installer points to a different Python
version than is being used in the notebook. In the simplest contexts this issue does not arise, but
when it does, debugging the problem requires knowledge of the intricacies of the operating
system, the intricacies of Python package installation, and the intricacies of Jupyter itself. In
other words, the Jupyter notebook, like all abstractions, is leaky.
In the wake of several discussions on this topic with colleagues, some online (exhibit A, exhibit
B) and some off, I decided to treat this issue in depth here. This post will address a couple things:
• First, I'll provide a quick, bare-bones answer to the general question, how can I install a
Python package so it works with my jupyter notebook, using pip and/or conda?.
• Second, I'll dive into some of the background of exactly what the Jupyter notebook
abstraction is doing, how it interacts with the complexities of the operating system, and
how you can think about where the "leaks" are, and thus better understand what's
happening when things stop working.
• Third, I'll talk about some ideas the community might consider to help smooth-over
these issues, including some changes that the Jupyter, Pip, and Conda developers might
consider to ease the cognitive load on users.
This post will focus on two approaches to installing Python packages: pip and conda. Other
package managers exist (including platform-specific tools like yum, apt, homebrew, etc., as well
as cross-platform tools like enstaller), but I'm less familiar with them and won't be remarking on
them further.

Quick Fix: How To Install Packages from the Jupyter


Notebook
If you're just looking for a quick answer to the question, how do I install packages so they work
with the notebook, then look no further.
pip vs. conda
First, a few words on pip vs. conda. For many users, the choice between pip and conda can be a
confusing one. I wrote way more than you ever want to know about these in a post last year, but
the essential difference between the two is this:
• pip installs python packages in any environment.
• conda installs any package in conda environments.
If you already have a Python installation that you're using, then the choice of which to use is
easy:
• If you installed Python using Anaconda or Miniconda, then use conda to install Python
packages. If conda tells you the package you want doesn't exist, then use pip (or try
conda-forge, which has more packages available than the default conda channel).
• If you installed Python any other way (from source, using pyenv, virtualenv, etc.), then
use pip to install Python packages
Finally, because it often comes up, I should mention that you should never use sudo pip
install.
NEVER.
It will always lead to problems in the long term, even if it seems to solve them in the short-term.
For example, if pip install gives you a permission error, it likely means you're trying to
install/update packages in a system python, such as /usr/bin/python. Doing this can have bad
consequences, as often the operating system itself depends on particular versions of packages
within that Python installation. For day-to-day Python usage, you should isolate your packages
from the system Python, using either virtual environments or Anaconda/Miniconda — I
personally prefer conda for this, but I know many colleagues who prefer virtualenv.
How to use Conda from the Jupyter Notebook
If you're in the jupyter notebook and you want to install a package with conda, you might be
tempted to use the ! notation to run conda directly as a shell command from the notebook:
# DON'T DO THIS!
!conda install --yes numpy
Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.


# packages in environment at /Users/jakevdp/anaconda/envs/python3.6:
#
numpy 1.13.3 py36h2cdce51_0
(Note that we use --yes to automatically answer y if and when conda asks for user confirmation)
For various reasons that I'll outline more fully below, this will not generally work if you want to
use these installed packages from the current notebook, though it may work in the simplest cases.
Here is a short snippet that should work in general:
# Install a conda package in the current Jupyter kernel
import sys
!conda install --yes --prefix {sys.prefix} numpy
Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.


# packages in environment at /Users/jakevdp/anaconda:
#
numpy 1.13.3 py36h2cdce51_0
That bit of extra boiler-plate makes certain that conda installs the package in the currently-
running Jupyter kernel (thanks to Min Ragan-Kelley for suggesting this approach). I'll discuss
why this is needed momentarily.
How to use Pip from the Jupyter Notebook
If you're using the Jupyter notebook and want to install a package with pip, you similarly might
be inclined to run pip directly in the shell:
# DON'T DO THIS
!pip install numpy
Requirement already satisfied: numpy in
/Users/jakevdp/anaconda/envs/python3.6/lib/python3.6/site-packages
For various reasons that I'll outline more fully below, this will not generally work if you want to
use these installed packages from the current notebook, though it may work in the simplest cases.
Here is a short snippet that should generally work:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy
Requirement already satisfied: numpy in
/Users/jakevdp/anaconda/lib/python3.6/site-packages
That bit of extra boiler-plate makes certain that you are running the pip version associated with
the current Python kernel, so that the installed packages can be used in the current notebook.
This is related to the fact that, even setting Jupyter notebooks aside, it's better to install packages
using
$ python -m pip install <package>
rather than
$ pip install <package>

because the former is more explicit about where the package will be installed (more on this
below).

The Details: Why is Installation from Jupyter so Messy?


Those above solutions should work in all cases... but why is that additional boilerplate
necessary? In short, it's because in Jupyter, the shell environment and the Python executable
are disconnected. Understanding why that matters depends on a basic understanding of a few
different concepts:
• how your operating system locates executable programs,
• how Python installs and locates packages
• how Jupyter decides which Python executable to use.
For completeness, I'm going to delve briefly into each of these topics (this discussion is partly
drawn from This StackOverflow answer that I wrote last year).
Note: the following discussion assumes Linux, Unix, MacOSX and similar operating systems.
Windows has a slightly different architecture, and so some details will differ.
How your operating system locates executables
When you're using the terminal and type a command like python, jupyter, ipython, pip,
conda, etc., your operating system contains a well-defined mechanism to find the executable file
the name refers to.
On Linux & Mac systems, the system will first check for an alias matching the command; if this
fails it references the $PATH environment variable:
!echo $PATH
/Users/jakevdp/anaconda/envs/python3.6/bin:/Users/jakevdp/anaconda/envs/pytho
n3.6/bin:/Users/jakevdp/anaconda/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/
sbin
$PATH lists the directories, in order, that will be searched for any executable: for example, if I
type python on my system with the above $PATH, it will first look for
/Users/jakevdp/anaconda/envs/python3.6/bin/python, and if that doesn't exist it will look
for /Users/jakevdp/anaconda/bin/python, and so on.
(Parenthetical note: why is the first entry of $PATH repeated twice here? Because every time you
launch jupyter notebook, Jupyter prepends the location of the jupyter executable to the
beginning of the $PATH. In this case, the location was already at the beginning of the path, and
the result is that the entry is duplicated. Duplicate entries add clutter, but cause no harm).
If you want to know what is actually executed when you type python, you can use the type shell
command:
!type python
python is /Users/jakevdp/anaconda/envs/python3.6/bin/python
Note that this is true of any command you use from the terminal:
!type ls
ls is /bin/ls
Even built-in commands like type itself:
!type type
type is a shell builtin
You can optionally add the -a tag to see all available versions of the command in your current
shell environment; for example:
!type -a python
python is /Users/jakevdp/anaconda/envs/python3.6/bin/python
python is /Users/jakevdp/anaconda/envs/python3.6/bin/python
python is /Users/jakevdp/anaconda/bin/python
python is /usr/bin/python
!type -a conda
conda is /Users/jakevdp/anaconda/envs/python3.6/bin/conda
conda is /Users/jakevdp/anaconda/envs/python3.6/bin/conda
conda is /Users/jakevdp/anaconda/bin/conda
!type -a pip
pip is /Users/jakevdp/anaconda/envs/python3.6/bin/pip
pip is /Users/jakevdp/anaconda/envs/python3.6/bin/pip
pip is /Users/jakevdp/anaconda/bin/pip
When you have multiple available versions of any command, it is important to keep in mind the
role of $PATH in choosing which will be used.
How Python locates packages
Python uses a similar mechanism to locate imported packages. The list of paths searched by
Python on import is found in sys.path:
import sys
sys.path
['',
'/Users/jakevdp/anaconda/lib/python36.zip',
'/Users/jakevdp/anaconda/lib/python3.6',
'/Users/jakevdp/anaconda/lib/python3.6/lib-dynload',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages/schemapi-
0.3.0.dev0+791c7f6-py3.6.egg',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages/setuptools-27.2.0-
py3.6.egg',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages/IPython/extensions',
'/Users/jakevdp/.ipython']
By default, the first place Python looks for a module is an empty path, meaning the current
working directory. If the module is not found there, it goes down the list of locations until the
module is found. You can find out which location has been used using the __path__ attribute of
an imported module:
import numpy
numpy.__path__
['/Users/jakevdp/anaconda/lib/python3.6/site-packages/numpy']
In most cases, a Python package you install with pip or with conda will be put in a directory
called site-packages. The important thing to realize is that each Python executable has its own
site-packages: what this means is that when you install a package, it is associated with
particular python executable and by default can only be used with that Python installation!
We can see this by printing the sys.path variables for each of the available python executables
in my path, using Jupyter's delightful ability to mix Python and bash commands in a single code
block:
paths = !type -a python
for path in set(paths):
path = path.split()[-1]
print(path)
!{path} -c "import sys; print(sys.path)"
print()
/Users/jakevdp/anaconda/envs/python3.6/bin/python
['', '/Users/jakevdp/anaconda/envs/python3.6/lib/python36.zip',
'/Users/jakevdp/anaconda/envs/python3.6/lib/python3.6',
'/Users/jakevdp/anaconda/envs/python3.6/lib/python3.6/lib-dynload',
'/Users/jakevdp/anaconda/envs/python3.6/lib/python3.6/site-packages']

/usr/bin/python
['',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
darwin',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
mac',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
mac/lib-scriptpackages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
tk',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
old',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
dynload', '/Library/Python/2.7/site-packages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/P
yObjC']

/Users/jakevdp/anaconda/bin/python
['', '/Users/jakevdp/anaconda/lib/python36.zip',
'/Users/jakevdp/anaconda/lib/python3.6',
'/Users/jakevdp/anaconda/lib/python3.6/lib-dynload',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages/schemapi-
0.3.0.dev0+791c7f6-py3.6.egg', '/Users/jakevdp/anaconda/lib/python3.6/site-
packages/setuptools-27.2.0-py3.6.egg']

The full details here are not particularly important, but it is important to emphasize that each
Python executable has its own distinct paths, and unless you modify sys.path (which should
only be done with great care) you cannot import packages installed in a different Python
environment.
When you run pip install or conda install, these commands are associated with a
particular Python version:
• pip installs packages in the Python in its same path
• conda installs packages in the current active conda environment
So, for example we see that pip install will install to the conda environment named
python3.6:
!type pip
pip is /Users/jakevdp/anaconda/envs/python3.6/bin/pip
And conda install will do the same, because python3.6 is the current active environment
(notice the * indicating the active environment):
!conda env list
# conda environments:
#
python2.7 /Users/jakevdp/anaconda/envs/python2.7
python3.5 /Users/jakevdp/anaconda/envs/python3.5
python3.6 * /Users/jakevdp/anaconda/envs/python3.6
rstats /Users/jakevdp/anaconda/envs/rstats
root /Users/jakevdp/anaconda

The reason both pip and conda default to the conda python3.6 environment is that this is the
Python environment I used to launch the notebook.
I'll say this again for emphasis: the shell environment in Jupyter notebook matches the
Python version used to launch the notebook.
How Jupyter executes code: Jupyter Kernels
The next relevant question is how Jupyter chooses to execute Python code, and this brings us to
the concept of a Jupyter Kernel.
A Jupyter kernel is a set of files that point Jupyter to some means of executing code within the
notebook. For Python kernels, this will point to a particular Python version, but Jupyter is
designed to be much more general than this: Jupyter has dozens of available kernels for
languages including Python 2, Python 3, Julia, R, Ruby, Haskell, and even C++ and Fortran!
If you're using the Jupyter notebook, you can change your kernel at any time using the Kernel →
Choose Kernel menu item.
To see the kernels you have available on your system, you can run the following command in the
shell:
!jupyter kernelspec list
Available kernels:
python3 /Users/jakevdp/anaconda/envs/python3.6/lib/python3.6/site-
packages/ipykernel/resources
conda-root /Users/jakevdp/Library/Jupyter/kernels/conda-root
python2.7 /Users/jakevdp/Library/Jupyter/kernels/python2.7
python3.5 /Users/jakevdp/Library/Jupyter/kernels/python3.5
python3.6 /Users/jakevdp/Library/Jupyter/kernels/python3.6
Each of these listed kernels is a directory that contains a file called kernel.json which
specifies, among other things, which language and executable the kernel should use. For
example:
!cat /Users/jakevdp/Library/Jupyter/kernels/conda-root/kernel.json
{
"argv": [
"/Users/jakevdp/anaconda/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "python (conda-root)",
"language": "python"
}
If you'd like to create a new kernel, you can do so using the jupyter ipykernel command; for
example, I created the above kernels for my primary conda environments using the following as
a template:
$ source activate myenv
$ python -m ipykernel install --user --name myenv --display-name "Python
(myenv)"

The Root of the Issue


Now we have the full background to answer our question: Why don't !pip install or !conda
install always work from the notebook?
The root of the issue is this: the shell environment is determined when the Jupyter notebook is
launched, while the Python executable is determined by the kernel, and the two do not
necessarily match. In other words, there is no guarantee that the python, pip, and conda in your
$PATH will be compatible with the python executable used by the notebook.
Recall that the python in your path can be determined using
!type python
python is /Users/jakevdp/anaconda/envs/python3.6/bin/python
The Python executable being used in the notebook can be determined using
sys.executable
'/Users/jakevdp/anaconda/bin/python'
In my current notebook environment, the two differ. This is why a simple !pip install or
!conda install does not work: the commands install packages in the site-packages of the
wrong Python installation.
As noted above, we can get around this by explicitly identifying where we want packages to be
installed.
For conda, you can set the prefix manually in the shell command:
$ conda install --yes --prefix /Users/jakevdp/anaconda numpy
or, to automatically use the correct prefix (using syntax available in the notebook)
!conda install --yes --prefix {sys.prefix} numpy
For pip, you can specify the Python executable explicitly:
$ /Users/jakevdp/anaconda/bin/python -m pip install numpy
or, to automatically use the correct executable (again using notebook shell syntax)
!{sys.executable} -m pip install numpy
Remember: you need your installation command to match the current python kernel if you want
installed packages to be available in the notebook.

Some Modest Proposals


So, in summary, the reason that installation of packages in the Jupyter notebook is fraught with
difficulty is fundamentally that Jupyter's shell environment and Python kernel are
mismatched, and that means that you have to do more than simply pip install or conda
install to make things work. The exception is the special case where you run jupyter
notebook from the same Python environment to which your kernel points; in that case the
simple installation approach should work.
But that leaves us in an undesireable place, as it increases the learning curve for novice users
who may want to do something they (rightly) presume should be simple: install a package and
then use it. So what can we as a community do to smooth-out this issue?
I have a few ideas, some of which might even be useful:
Potential Changes to Jupyter
As I mentioned, the fundamental issue is a mismatch between Jupyter's shell environment and
compute kernel. So, could we massage kernel specifications such that they force the two to
match?
Perhaps: for example, this github issue shows an approach to modifying shell variables as part of
kernel startup.
Basically, in your kernel directory, you can add a script kernel-startup.sh that looks
something like this (and make sure you change the permissions so that it's executable):
#!/usr/bin/env bash

# activate anaconda env


source activate myenv

# this is the critical part, and should be at the end of your script:
exec python -m ipykernel $@
Then in your kernel.json file, modify the argv field to look like this:
"argv": [
"/path/to/kernel-startup.sh",
"-f",
"{connection_file}"
]
Once you do this, switching to the myenv kernel will automatically activate the myenv conda
environment, which changes your $CONDA_PREFIX, $PATH and other system variables such that
!conda install XXX and !pip install XXX will work correctly. A similar approach could
work for virtualenvs or other Python environments.
There is one tricky issue here: this approach will fail if your myenv environment does not have
the ipykernel package installed, and probably also requires it to have a jupyter version
compatible with that used to launch the notebook. So it's not a full solution to the problem by any
means, but if Python kernels could be designed to do this sort of shell initialization by default, it
would be far less confusing to users: !pip install and !conda install would simply work.
Potential Changes to pip
One source of installation confusion, even outside of Jupyter, is the fact that, depending on the
nature of your system's aliases and $PATH variable, pip and python might point to different
paths. In this case pip install will install packages to a path inaccessible to the python
executable. For this reason, it is safer to use python -m pip install, which explicitly specifies
the desired Python version (explicit is better than implicit, after all).
This is one reason that pip install no longer appears in Python's docs, and experienced Python
educators like David Beazley never teach bare pip. CPython developer Nick Coghlan has even
indicated that the pip executable may someday be deprecated in favor of python -m pip. Even
though it's more verbose, I think forcing users to be explicit would be a useful change,
particularly as the use of virtualenvs and conda envs becomes more common.
Changes to Conda
I can think of a couple modifications to conda's API that may be helpful to users
Explicit invocation
For symmetry with pip, it would be nice if python -m conda install could be expected to
work in the same way the pip counterpart does. You can call conda this way in the root
environment, but the conda Python package (as opposed to the conda executable) cannot
currently be installed anywhere but the root environment:
(myenv) jakevdp$ conda install conda
Fetching package metadata ...........

InstallError: Error: 'conda' can only be installed into the root environment
I suspect that allowing python -m conda install in all conda environments would require a
fairly significant redesign of conda's installation model, so it may not be worth the change just
for symmetry with pip's API. That said, such a symmetry would certainly be a help to users.
A pip channel for conda?
Another useful change conda could make would be to add a channel that essentially mirrors the
Python Package Index, so that when you do conda install some-package it will
automatically draw from packages available to pip as well.
I don't have a deep enough knowledge of conda's architecture to know how easy such a feature
would be to implement, but I do have loads of experiences helping newcomers to Python and/or
conda: I can say with certainty that such a feature would go a long way toward softening their
learning curve.
New Jupyter Magic Functions
Even if the above changes to the stack are not possible or desirable, we could simplify the user
experience somewhat by introducing %pip and %conda magic functions within the Jupyter
notebook that detect the current kernel and make certain packages are installed in the correct
location.
pip magic
For example, here's how you can define a %pip magic function that works in the current kernel:
from IPython.core.magic import register_line_magic

@register_line_magic
def pip(args):
"""Use pip from the current kernel"""
from pip import main
main(args.split())
Running it as follows will install packages in the expected location
%pip install numpy
Requirement already satisfied: numpy in
/Users/jakevdp/anaconda/lib/python3.6/site-packages
Note that Jupyter developer Matthias Bussonnier has published essentially this in his pip_magic
repository, so you can do
$ python -m pip install pip_magic
and use this right now (that is, assuming you install pip_magic in the right place!)
conda magic
Similarly, we can define a conda magic that will do the right thing if you type %conda install
XXX. This is a bit more involved than the pip magic, because it must first confirm that the
environment is conda-compatible, and then (related to the lack of python -m conda install)
must call a subprocess to execute the appropriate shell command:
from IPython.core.magic import register_line_magic
import sys
import os
from subprocess import Popen, PIPE

def is_conda_environment():
"""Return True if the current Python executable is in a conda env"""
# TODO: make this work with Conda.exe in Windows
conda_exec = os.path.join(os.path.dirname(sys.executable), 'conda')
conda_history = os.path.join(sys.prefix, 'conda-meta', 'history')
return os.path.exists(conda_exec) and os.path.exists(conda_history)

@register_line_magic
def conda(args):
"""Use conda from the current kernel"""
# TODO: make this work with Conda.exe in Windows
# TODO: fix string encoding to work with Python 2
if not is_conda_environment():
raise ValueError("The python kernel does not appear to be a conda
environment. "
"Please use ``%pip install`` instead.")

conda_executable = os.path.join(os.path.dirname(sys.executable), 'conda')


args = [conda_executable] + args.split()

# Add --prefix to point conda installation to the current environment


if args[1] in ['install', 'update', 'upgrade', 'remove', 'uninstall',
'list']:
if '-p' not in args and '--prefix' not in args:
args.insert(2, '--prefix')
args.insert(3, sys.prefix)

# Because the notebook does not allow us to respond "yes" during the
# installation, we need to insert --yes in the argument list for some
commands
if args[1] in ['install', 'update', 'upgrade', 'remove', 'uninstall',
'create']:
if '-y' not in args and '--yes' not in args:
args.insert(2, '--yes')

# Call conda from command line with subprocess & send results to stdout &
stderr
with Popen(args, stdout=PIPE, stderr=PIPE) as process:
# Read stdout character by character, as it includes real-time
progress updates
for c in iter(lambda: process.stdout.read(1), b''):
sys.stdout.write(c.decode(sys.stdout.encoding))
# Read stderr line by line, because real-time does not matter
for line in iter(process.stderr.readline, b''):
sys.stderr.write(line.decode(sys.stderr.encoding))
You can now use %conda install and it will install packages to the correct environment:
%conda install numpy
Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.


# packages in environment at /Users/jakevdp/anaconda:
#
numpy 1.13.3 py36h2cdce51_0
This conda magic still needs some work to be a general solution (cf. the TODO comments in the
code), but I think this is a useful start.
If a pip magic and conda magic similar to the above were added to Jupyter's default set of magic
commands, I think it could go a long way toward solving the common problems that users have
when trying to install Python packages for use with Jupyter notebooks. This approach is not
without its own dangers, though: these magics are yet another layer of abstraction that, like all
abstractions, will inevitably leak. But if they are implemented carefully, I think it would lead to a
much nicer overall user experience.

Python Modules
In this article, you will learn to create and import custom modules in Python. Also, you will find
different techniques to import and use custom and built-in modules in Python.
Table of Contents
• What are modules in Python?
• How to import modules in Python?
• Python import statement
• Import with renaming
• Python from...import statement
• Import all names
• Python Module Search Path
• Reloading a module
• The dir() built-in function

What are modules in Python?


Modules refer to a file containing Python statements and definitions.
A file containing Python code, for e.g.: example.py, is called a module and its module name
would be example.
We use modules to break down large programs into small manageable and organized files.
Furthermore, modules provide reusability of code.
We can define our most used functions in a module and import it, instead of copying their
definitions into different programs.
Let us create a module. Type the following and save it as example.py.
# Python Module example

def add(a, b):


"""This program adds two
numbers and return the result"""

result = a + b
return result
Here, we have defined a function add() inside a module named example. The function takes in
two numbers and returns their sum.

How to import modules in Python?


We can import the definitions inside a module to another module or the interactive interpreter in
Python.
We use the import keyword to do this. To import our previously defined module example we
type the following in the Python prompt.
>>> import example
This does not enter the names of the functions defined in example directly in the current symbol
table. It only enters the module name example there.
Using the module name we can access the function using dot (.) operation. For example:
>>> example.add(4,5.5)
9.5
Python has a ton of standard modules available.
You can check out the full list of Python standard modules and what they are for. These files are
in the Lib directory inside the location where you installed Python.
Standard modules can be imported the same way as we import our user-defined modules.
There are various ways to import modules. They are listed as follows.
Python import statement
We can import a module using import statement and access the definitions inside it using the dot
operator as described above. Here is an example.
• script.py
• IPython Shell

Powered by DataCamp

When you run the program, the output will be:


The value of pi is 3.141592653589793

Import with renaming


We can import a module by renaming it as follows.
• script.py
• IPython Shell

Powered by DataCamp

We have renamed the math module as m. This can save us typing time in some cases.
Note that the name math is not recognized in our scope. Hence, math.pi is invalid, m.pi is the
correct implementation.
Python from...import statement
We can import specific names from a module without importing the module as a whole. Here is
an example.
• script.py
• IPython Shell

Powered by DataCamp

We imported only the attribute pi from the module.


In such case we don't use the dot operator. We could have imported multiple attributes as
follows.
>>> from math import pi, e
>>> pi
3.141592653589793
>>> e
2.718281828459045

Import all names


We can import all names(definitions) from a module using the following construct.
• script.py
• IPython Shell

Powered by DataCamp

We imported all the definitions from the math module. This makes all names except those
beginnig with an underscore, visible in our scope.
Importing everything with the asterisk (*) symbol is not a good programming practice. This can
lead to duplicate definitions for an identifier. It also hampers the readability of our code.

Python Module Search Path


While importing a module, Python looks at several places. Interpreter first looks for a built-in
module then (if not found) into a list of directories defined in sys.path. The search is in this
order.
• The current directory.
• PYTHONPATH (an environment variable with a list of directory).
• The installation-dependent default directory.
>>> import sys
>>> sys.path
['',
'C:\\Python33\\Lib\\idlelib',
'C:\\Windows\\system32\\python33.zip',
'C:\\Python33\\DLLs',
'C:\\Python33\\lib',
'C:\\Python33',
'C:\\Python33\\lib\\site-packages']
We can add modify this list to add our own path.

Reloading a module
The Python interpreter imports a module only once during a session. This makes things more
efficient. Here is an example to show how this works.
Suppose we have the following code in a module named my_module.
# This module shows the effect of
# multiple imports and reload
print("This code got executed")
Now we see the effect of multiple imports.
>>> import my_module
This code got executed
>>> import my_module
>>> import my_module
We can see that our code got executed only once. This goes to say that our module was imported
only once.
Now if our module changed during the course of the program, we would have to reload it.One
way to do this is to restart the interpreter. But this does not help much.
Python provides a neat way of doing this. We can use the reload() function inside the imp
module to reload a module. This is how its done.
>>> import imp
>>> import my_module
This code got executed
>>> import my_module
>>> imp.reload(my_module)
This code got executed
<module 'my_module' from '.\\my_module.py'>

The dir() built-in function


We can use the dir() function to find out names that are defined inside a module.
For example, we have defined a function add() in the module example that we had in the
beginning.
>>> dir(example)
['__builtins__',
'__cached__',
'__doc__',
'__file__',
'__initializing__',
'__loader__',
'__name__',
'__package__',
'add']
Here, we can see a sorted list of names (along with add). All other names that begin with an
underscore are default Python attributes associated with the module (we did not define them
ourself).
For example, the __name__ attribute contains the name of the module.
>>> import example
>>> example.__name__
'example'
All the names defined in our current namespace can be found out using the dir() function
without any arguments.
>>> a = 1
>>> b = "hello"
>>> import math
>>> dir()
['__builtins__', '__doc__', '__name__', 'a', 'b', 'math', 'pyscripter']

Python Package
In this article, you'll learn to divide your code base into clean, efficient modules using Python
packages. Also, you'll learn to import and use your own or third party packagesin your Python
program.
Table of Contents
• What are packages?
• Importing module from a package

What are packages?


We don't usually store all of our files in our computer in the same location. We use a well-
organized hierarchy of directories for easier access.
Similar files are kept in the same directory, for example, we may keep all the songs in the
"music" directory. Analogous to this, Python has packages for directories and modules for files.
As our application program grows larger in size with a lot of modules, we place similar modules
in one package and different modules in different packages. This makes a project (program) easy
to manage and conceptually clear.
Similar, as a directory can contain sub-directories and files, a Python package can have sub-
packages and modules.
A directory must contain a file named __init__.py in order for Python to consider it as a
package. This file can be left empty but we generally place the initialization code for that
package in this file.
Here is an example. Suppose we are developing a game, one possible organization of packages
and modules could be as shown in the figure below.
Importing module from a package
We can import modules from packages using the dot (.) operator.
For example, if want to import the start module in the above example, it is done as follows.
import Game.Level.start
Now if this module contains a function named select_difficulty(), we must use the full
name to reference it.
Game.Level.start.select_difficulty(2)
If this construct seems lengthy, we can import the module without the package prefix as follows.
from Game.Level import start
We can now call the function simply as follows.
start.select_difficulty(2)
Yet another way of importing just the required function (or class or variable) form a module
within a package would be as follows.
from Game.Level.start import select_difficulty
Now we can directly call this function.
select_difficulty(2)
Although easier, this method is not recommended. Using the full namespace avoids confusion
and prevents two same identifier names from colliding.
While importing packages, Python looks in the list of directories defined in sys.path, similar as
for module search path.
Understand Python Namespace and Scope
with Examples
Python Tutorials | By Meenakshi Agarwal

In this class, we’ll cover what is Python namespace and why is it needed? We’ll also talk about
what is scope in Python and how namespaces can be used to implement it.
The concept of namespaces is not limited to any particular programming language. C/C++ and
Java also have it where it works as a means to distinguish between different sections of a
program.
The body of a section may consist of a method, or a function, or all the methods of a class. So, a
namespace is a practical approach to define the scope, and it helps to avoid name conflicts.
While in Python, the namespace is a fundamental idea to structure and organize the code,
especially more useful in large projects. However, it could be a bit difficult concept to grasp if
you’re new to programming. Hence, we tried to make namespaces just a little easier to
understand.

Python Namespace and Scope


Python Namespace, Scope, and Scope Resolution
What are names in Python?
Before getting on to namespaces, first, let’s understand what Python means by a name.
A name in Python is just a way to access a variable like in any other languages. However, Python
is more flexible when it comes to the variable declaration. You can declare a variable by just
assigning a name to it.
You can use names to reference values.
num = 5
str = 'Z'
seq = [0, 1, 1, 2, 3, 5]
You can even assign a name to a function.
def function():
print('It is a function.')

foo = function
foo()
You can also assign a name and then reuse it. Check the below example; it is alright for a name
to point to different values.
test = -1
print("type <test> :=", type(test))
test = "Pointing to a string now"
print("type <test> :=", type(test))
test = [0, 1, 1, 2, 3, 5, 8]
print("type <test> :=", type(test))
And here is the output follows.
type <test> := <class 'int'>
type <test> := <class 'str'>
type <test> := <class 'list'>
So, you can see that one name is working perfectly fine to hold data of different types.
You can learn more about types in Python from here – Python data types.
The naming mechanism works inline with Python’s object system, i.e., everything in Python is
an object. All the data types such as numbers, strings, functions, classes are all objects. And a
name acts as a reference to get to the objects.
What are namespaces in Python?
A namespace is a simple system to control the names in a program. It ensures that names are
unique and won’t lead to any conflict.
Also, add to your knowledge that Python implements namespaces in the form of dictionaries. It
maintains a name-to-object mapping where names act as keys and the objects as values. Multiple
namespaces may have the same name but pointing to a different variable. Check out a few
examples of namespaces for more clarity.
Local Namespace
This namespace covers the local names inside a function. Python creates this namespace for
every function called in a program. It remains active until the function returns.
Global Namespace
This namespace covers the names from various imported modules used in a project. Python
creates this namespace for every module included in your program. It’ll last until the program
ends.
Built-in Namespace
This namespace covers the built-in functions and built-in exception names. Python creates it as
the interpreter starts and keeps it until you exit.
What is Scope in Python?
Namespaces make our programs immune from name conflicts. However, it doesn’t give us a free
ride to use a variable name anywhere we want. Python restricts names to be bound by specific
rules known as a scope. The scope determines the parts of the program where you could use that
name without any prefix.
• Python outlines different scopes for locals, function, modules, and built-ins. Check out from the
below list.
• A local scope, also known as the innermost scope, holds the list of all local names available in
the current function.
• A scope for all the enclosing functions, it finds a name from the nearest enclosing scope and
goes outwards.
• A module level scope, it takes care of all the global names from the current module.
• The outermost scope which manages the list of all the built-in names. It is the last place to
search for a name that you cited in the program.
Scope Resolution in Python – Examples
Scope resolution for a given name begins from the inner-most function and then goes higher and
higher until the program finds the related object. If the search ends without any outcome, then
the program throws a NameError exception.
Let’s now see some examples which you can run inside any Python IDE or with IDLE.
a_var = 10
print("begin()-> ", dir())

def foo():
b_var = 11
print("inside foo()-> ", dir())

foo()

print("end()-> ", dir())


The output is as follows.
begin()-> ['__builtins__', '__doc__', '__file__', '__loader__', '__name__',
'__package__', '__spec__', 'a_var']
inside foo()-> ['b_var']
end()-> ['__builtins__', '__doc__', '__file__', '__loader__', '__name__',
'__package__', '__spec__', 'a_var', 'foo']
In this example, we used the dir() function. It lists all the names that are available in a Python
program then.
In the first print() statement, the dir() only displays the list of names inside the current scope.
While in the second print(), it finds only one name, “b_var,” a local function variable.
Calling dir() after defining the foo() pushes it to the list of names available in the global
namespace.
In the next example, we’ll see the list of names inside some nested functions. The code in this
block continues from the previous block.
def outer_foo():
outer_var = 3
def inner_foo():
inner_var = 5
print(dir(), ' - names in inner_foo')
outer_var = 7
inner_foo()
print(dir(), ' - names in outer_foo')

outer_foo()
The output is as follows.
['inner_var'] - names in inner_foo
['inner_foo', 'outer_var'] - names in outer_foo
The above example defines two variables and a function inside the scope of outer_foo(). Inside
the inner_foo(), the dir() function only displays one name i.e. “inner_var”. It is alright as the
“inner_var” is the only variable defined in there.
If you reuse a global name inside a local namespace, then Python creates a new local variable
with the same name.
a_var = 5
b_var = 7

def outer_foo():
global a_var
a_var = 3
b_var = 9
def inner_foo():
global a_var
a_var = 4
b_var = 8
print('a_var inside inner_foo :', a_var)
print('b_var inside inner_foo :', b_var)
inner_foo()
print('a_var inside outer_foo :', a_var)
print('b_var inside outer_foo :', b_var)

outer_foo()
print('a_var outside all functions :', a_var)
print('b_var outside all functions :', b_var)
Here goes the output of the above code after execution.
a_var inside inner_foo : 4
b_var inside inner_foo : 8
a_var inside outer_foo : 4
b_var inside outer_foo : 9
a_var outside all functions : 4
b_var outside all functions : 7
We’ve declared a global variable as “a_var” inside both the outer_foo() and inner_foo()
functions. However, we’ve assigned different values in the same global variable. And that’s the
reason the value of “a_var” is same (i.e., 4) on all occasions.
Whereas, each function is creating its own “b_var” variable inside the local scope. And the
print() function is showing the values of this variable as per its local context.
How to correctly import modules in Python?
It is very likely that you would import some of the external modules in your program. So, we’ll
discuss here some of the import strategies, and you can choose the best one.
Import all names from a module
from <module name> import *
It’ll import all the names from a module directly into your working namespace. Since it is an
effortless way, so you might tempt to use this method. However, you may not be able to tell
which module imported a particular function.
Here is an example of using this method.
print("namespace_1: ", dir())

from math import *


print("namespace_2: ", dir())
print(sqrt(144.2))

from cmath import *


print("namespace_3: ", dir())
print(sqrt(144.2))
The output of the above code is as follows.
namespace_1: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__']
namespace_2: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh',
'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e',
'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp',
'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf',
'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan',
'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan', 'tanh', 'trunc']
12.00833044182246
namespace_3: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__', 'acos', 'acosh', 'asin', 'asinh',
'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos', 'cosh', 'degrees', 'e',
'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor', 'fmod', 'frexp',
'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose', 'isfinite', 'isinf',
'isnan', 'ldexp', 'lgamma', 'log', 'log10', 'log1p', 'log2', 'modf', 'nan',
'phase', 'pi', 'polar', 'pow', 'radians', 'rect', 'sin', 'sinh', 'sqrt',
'tan', 'tanh', 'trunc']
(12.00833044182246+0j)
In this example, we’ve imported two distinct math modules, one after the other. There are some
common names which both of these modules have. So, the second module will override the
definitions of functions in the first.
The first call to sqrt() returns a real number and the second one gives a complex number. And
now, there is no way we can call the sqrt() function from the first math module.
Even if we call the function using the module name, then Python will raise the NameError
exception. So, the lesson learned here is that there are no shortcuts for quality code.
Import specific names from a module
from <module name> import <foo_1>, <foo_2>
If you are sure of the names to be used from a module, then import them directly into your
program. This method is slightly better but will not prevent you from polluting the namespace
completely. It is because you can’t use any other name from the module. Here also, any function
having the same name in your program will also override the same definition present in the
module. The affected method will become dormant in such a case.
Check out an example of using this method.
print("namespace_1: ", dir())

from math import sqrt, pow


print("namespace_2: ", dir())
print(sqrt(144.2))
The output of the above code is as follows.
namespace_1: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__']
namespace_2: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__', 'pow', 'sqrt']
12.00833044182246

Import just the module using its name


import <module name>
It is the most reliable and suggested way of importing a module. However, it comes with a catch
that you need to prefix the name of the module before using any name from it. But you can
prevent the program from polluting the namespace and freely define functions with matching
names in the module.
print("namespace_1: ", dir())

import math
print("namespace_2: ", dir())
print(math.sqrt(144.2))
The output of the above example goes like this.
namespace_1: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__']
namespace_2: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__', 'math']
12.00833044182246

Math in python

9.2. math — Mathematical functions


This module is always available. It provides access to the mathematical functions defined by the
C standard.
These functions cannot be used with complex numbers; use the functions of the same name from
the cmath module if you require support for complex numbers. The distinction between
functions which support complex numbers and those which don’t is made since most users do
not want to learn quite as much mathematics as required to understand complex numbers.
Receiving an exception instead of a complex result allows earlier detection of the unexpected
complex number used as a parameter, so that the programmer can determine how and why it was
generated in the first place.
The following functions are provided by this module. Except when explicitly noted otherwise, all
return values are floats.

9.2.1. Number-theoretic and representation functions


math.ceil(x)

Return the ceiling of x as a float, the smallest integer value greater than or equal to x.
math.copysign(x, y)

Return x with the sign of y. On a platform that supports signed zeros, copysign(1.0, -
0.0) returns -1.0.
New in version 2.6.
math.fabs(x)

Return the absolute value of x.


math.factorial(x)

Return x factorial. Raises ValueError if x is not integral or is negative.


New in version 2.6.
math.floor(x)

Return the floor of x as a float, the largest integer value less than or equal to x.
math.fmod(x, y)

Return fmod(x, y), as defined by the platform C library. Note that the Python
expression x % y may not return the same result. The intent of the C standard is that
fmod(x, y) be exactly (mathematically; to infinite precision) equal to x - n*y for some
integer n such that the result has the same sign as x and magnitude less than abs(y).
Python’s x % y returns a result with the sign of y instead, and may not be exactly
computable for float arguments. For example, fmod(-1e-100, 1e100) is -1e-100, but
the result of Python’s -1e-100 % 1e100 is 1e100-1e-100, which cannot be represented
exactly as a float, and rounds to the surprising 1e100. For this reason, function fmod() is
generally preferred when working with floats, while Python’s x % y is preferred when
working with integers.
math.frexp(x)

Return the mantissa and exponent of x as the pair (m, e). m is a float and e is an integer
such that x == m * 2**e exactly. If x is zero, returns (0.0, 0), otherwise 0.5 <=
abs(m) < 1. This is used to “pick apart” the internal representation of a float in a
portable way.
math.fsum(iterable)

Return an accurate floating point sum of values in the iterable. Avoids loss of precision
by tracking multiple intermediate partial sums:
>>> sum([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1])
0.9999999999999999
>>> fsum([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1])
1.0
The algorithm’s accuracy depends on IEEE-754 arithmetic guarantees and the typical
case where the rounding mode is half-even. On some non-Windows builds, the
underlying C library uses extended precision addition and may occasionally double-round
an intermediate sum causing it to be off in its least significant bit.
For further discussion and two alternative approaches, see the ASPN cookbook recipes
for accurate floating point summation.
New in version 2.6.
math.isinf(x)

Check if the float x is positive or negative infinity.


New in version 2.6.
math.isnan(x)

Check if the float x is a NaN (not a number). For more information on NaNs, see the
IEEE 754 standards.
New in version 2.6.
math.ldexp(x, i)

Return x * (2**i). This is essentially the inverse of function frexp().


math.modf(x)

Return the fractional and integer parts of x. Both results carry the sign of x and are floats.
math.trunc(x)

Return the Real value x truncated to an Integral (usually a long integer). Uses the
__trunc__ method.
New in version 2.6.
Note that frexp() and modf() have a different call/return pattern than their C equivalents: they
take a single argument and return a pair of values, rather than returning their second return value
through an ‘output parameter’ (there is no such thing in Python).
For the ceil(), floor(), and modf() functions, note that all floating-point numbers of
sufficiently large magnitude are exact integers. Python floats typically carry no more than 53 bits
of precision (the same as the platform C double type), in which case any float x with abs(x) >=
2**52 necessarily has no fractional bits.

9.2.2. Power and logarithmic functions


math.exp(x)

Return e**x.
math.expm1(x)
Return e**x - 1. For small floats x, the subtraction in exp(x) - 1 can result in a
significant loss of precision; the expm1() function provides a way to compute this
quantity to full precision:
>>> from math import exp, expm1
>>> exp(1e-5) - 1 # gives result accurate to 11 places
1.0000050000069649e-05
>>> expm1(1e-5) # result accurate to full precision
1.0000050000166668e-05
New in version 2.7.
math.log(x[, base])

With one argument, return the natural logarithm of x (to base e).
With two arguments, return the logarithm of x to the given base, calculated as
log(x)/log(base).
Changed in version 2.3: base argument added.
math.log1p(x)

Return the natural logarithm of 1+x (base e). The result is calculated in a way which is
accurate for x near zero.
New in version 2.6.
math.log10(x)

Return the base-10 logarithm of x. This is usually more accurate than log(x, 10).
math.pow(x, y)

Return x raised to the power y. Exceptional cases follow Annex ‘F’ of the C99 standard
as far as possible. In particular, pow(1.0, x) and pow(x, 0.0) always return 1.0, even
when x is a zero or a NaN. If both x and y are finite, x is negative, and y is not an integer
then pow(x, y) is undefined, and raises ValueError.
Unlike the built-in ** operator, math.pow() converts both its arguments to type float.
Use ** or the built-in pow() function for computing exact integer powers.
Changed in version 2.6: The outcome of 1**nan and nan**0 was undefined.
math.sqrt(x)

Return the square root of x.

9.2.3. Trigonometric functions


math.acos(x)

Return the arc cosine of x, in radians.


math.asin(x)

Return the arc sine of x, in radians.


math.atan(x)

Return the arc tangent of x, in radians.


math.atan2(y, x)

Return atan(y / x), in radians. The result is between -pi and pi. The vector in the
plane from the origin to point (x, y) makes this angle with the positive X axis. The
point of atan2() is that the signs of both inputs are known to it, so it can compute the
correct quadrant for the angle. For example, atan(1) and atan2(1, 1) are both pi/4,
but atan2(-1, -1) is -3*pi/4.
math.cos(x)

Return the cosine of x radians.


math.hypot(x, y)

Return the Euclidean norm, sqrt(x*x + y*y). This is the length of the vector from the
origin to point (x, y).
math.sin(x)

Return the sine of x radians.


math.tan(x)

Return the tangent of x radians.

9.2.4. Angular conversion


math.degrees(x)

Convert angle x from radians to degrees.


math.radians(x)

Convert angle x from degrees to radians.

9.2.5. Hyperbolic functions


math.acosh(x)

Return the inverse hyperbolic cosine of x.


New in version 2.6.
math.asinh(x)

Return the inverse hyperbolic sine of x.


New in version 2.6.
math.atanh(x)

Return the inverse hyperbolic tangent of x.


New in version 2.6.
math.cosh(x)

Return the hyperbolic cosine of x.


math.sinh(x)

Return the hyperbolic sine of x.


math.tanh(x)

Return the hyperbolic tangent of x.

9.2.6. Special functions


math.erf(x)

Return the error function at x.


New in version 2.7.
math.erfc(x)

Return the complementary error function at x.


New in version 2.7.
math.gamma(x)

Return the Gamma function at x.


New in version 2.7.
math.lgamma(x)

Return the natural logarithm of the absolute value of the Gamma function at x.
New in version 2.7.

9.2.7. Constants
math.pi

The mathematical constant π = 3.141592…, to available precision.


math.e

The mathematical constant e = 2.718281…, to available precision.

The Basics¶
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements
(usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy
dimensions are called axes. The number of axes is rank.
For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has
one axis. That axis has a length of 3. In the example pictured below, the array has rank 2 (it is 2-
dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of
3.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array
is not the same as the Standard Python Library class array.array, which only handles one-
dimensional arrays and offers less functionality. The more important attributes of an ndarray
object are:
ndarray.ndim
the number of axes (dimensions) of the array. In the Python world, the number of
dimensions is referred to as rank.
ndarray.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in
each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length
of the shape tuple is therefore the rank, or number of dimensions, ndim.
ndarray.size
the total number of elements of the array. This is equal to the product of the elements of
shape.
ndarray.dtype
an object describing the type of the elements in the array. One can create or specify
dtype’s using standard Python types. Additionally NumPy provides types of its own.
numpy.int32, numpy.int16, and numpy.float64 are some examples.
ndarray.itemsize
the size in bytes of each element of the array. For example, an array of elements of type
float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is
equivalent to ndarray.dtype.itemsize.
ndarray.data
the buffer containing the actual elements of the array. Normally, we won’t need to use
this attribute because we will access the elements in an array using indexing facilities.
An example
>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
'int64'
>>> a.itemsize
8
>>> a.size
15
>>> type(a)
<type 'numpy.ndarray'>
>>> b = np.array([6, 7, 8])
>>> b
array([6, 7, 8])
>>> type(b)
<type 'numpy.ndarray'>

Array Creation
There are several ways to create arrays.
For example, you can create an array from a regular Python list or tuple using the array function.
The type of the resulting array is deduced from the type of the elements in the sequences.
>>> import numpy as np
>>> a = np.array([2,3,4])
>>> a
array([2, 3, 4])
>>> a.dtype
dtype('int64')
>>> b = np.array([1.2, 3.5, 5.1])
>>> b.dtype
dtype('float64')
A frequent error consists in calling array with multiple numeric arguments, rather than providing
a single list of numbers as an argument.
>>> a = np.array(1,2,3,4) # WRONG
>>> a = np.array([1,2,3,4]) # RIGHT
array transforms sequences of sequences into two-dimensional arrays, sequences of sequences of
sequences into three-dimensional arrays, and so on.
>>> b = np.array([(1.5,2,3), (4,5,6)])
>>> b
array([[ 1.5, 2. , 3. ],
[ 4. , 5. , 6. ]])
The type of the array can also be explicitly specified at creation time:
>>> c = np.array( [ [1,2], [3,4] ], dtype=complex )
>>> c
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]])
Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy
offers several functions to create arrays with initial placeholder content. These minimize the
necessity of growing arrays, an expensive operation.
The function zeros creates an array full of zeros, the function ones creates an array full of ones,
and the function empty creates an array whose initial content is random and depends on the state
of the memory. By default, the dtype of the created array is float64.
>>> np.zeros( (3,4) )
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>> np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be
specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]]], dtype=int16)
>>> np.empty( (2,3) ) # uninitialized, output
may vary
array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260],
[ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]])
To create sequences of numbers, NumPy provides a function analogous to range that returns
arrays instead of lists.
>>> np.arange( 10, 30, 5 )
array([10, 15, 20, 25])
>>> np.arange( 0, 2, 0.3 ) # it accepts float arguments
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])
When arange is used with floating point arguments, it is generally not possible to predict the
number of elements obtained, due to the finite floating point precision. For this reason, it is
usually better to use the function linspace that receives as an argument the number of elements
that we want, instead of the step:
>>> from numpy import pi
>>> np.linspace( 0, 2, 9 ) # 9 numbers from 0 to 2
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ])
>>> x = np.linspace( 0, 2*pi, 100 ) # useful to evaluate function at
lots of points
>>> f = np.sin(x)
See also
array, zeros, zeros_like, ones, ones_like, empty, empty_like, arange, linspace,
numpy.random.rand, numpy.random.randn, fromfunction, fromfile
Printing Arrays
When you print an array, NumPy displays it in a similar way to nested lists, but with the
following layout:
• the last axis is printed from left to right,
• the second-to-last is printed from top to bottom,
• the rest are also printed from top to bottom, with each slice separated from the next by an
empty line.
One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals
as lists of matrices.
>>> a = np.arange(6) # 1d array
>>> print(a)
[0 1 2 3 4 5]
>>>
>>> b = np.arange(12).reshape(4,3) # 2d array
>>> print(b)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
>>>
>>> c = np.arange(24).reshape(2,3,4) # 3d array
>>> print(c)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
See below to get more details on reshape.
If an array is too large to be printed, NumPy automatically skips the central part of the array and
only prints the corners:
>>> print(np.arange(10000))
[ 0 1 2 ..., 9997 9998 9999]
>>>
>>> print(np.arange(10000).reshape(100,100))
[[ 0 1 2 ..., 97 98 99]
[ 100 101 102 ..., 197 198 199]
[ 200 201 202 ..., 297 298 299]
...,
[9700 9701 9702 ..., 9797 9798 9799]
[9800 9801 9802 ..., 9897 9898 9899]
[9900 9901 9902 ..., 9997 9998 9999]]
To disable this behaviour and force NumPy to print the entire array, you can change the printing
options using set_printoptions.
>>> np.set_printoptions(threshold=np.nan)

Basic Operations
Arithmetic operators on arrays apply elementwise. A new array is created and filled with the
result.
>>> a = np.array( [20,30,40,50] )
>>> b = np.arange( 4 )
>>> b
array([0, 1, 2, 3])
>>> c = a-b
>>> c
array([20, 29, 38, 47])
>>> b**2
array([0, 1, 4, 9])
>>> 10*np.sin(a)
array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
>>> a<35
array([ True, True, False, False], dtype=bool)
Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays.
The matrix product can be performed using the dot function or method:
>>> A = np.array( [[1,1],
... [0,1]] )
>>> B = np.array( [[2,0],
... [3,4]] )
>>> A*B # elementwise product
array([[2, 0],
[0, 4]])
>>> A.dot(B) # matrix product
array([[5, 4],
[3, 4]])
>>> np.dot(A, B) # another matrix product
array([[5, 4],
[3, 4]])
Some operations, such as += and *=, act in place to modify an existing array rather than create a
new one.
>>> a = np.ones((2,3), dtype=int)
>>> b = np.random.random((2,3))
>>> a *= 3
>>> a
array([[3, 3, 3],
[3, 3, 3]])
>>> b += a
>>> b
array([[ 3.417022 , 3.72032449, 3.00011437],
[ 3.30233257, 3.14675589, 3.09233859]])
>>> a += b # b is not automatically converted to integer
type
Traceback (most recent call last):
...
TypeError: Cannot cast ufunc add output from dtype('float64') to
dtype('int64') with casting rule 'same_kind'
When operating with arrays of different types, the type of the resulting array corresponds to the
more general or precise one (a behavior known as upcasting).
>>> a = np.ones(3, dtype=np.int32)
>>> b = np.linspace(0,pi,3)
>>> b.dtype.name
'float64'
>>> c = a+b
>>> c
array([ 1. , 2.57079633, 4.14159265])
>>> c.dtype.name
'float64'
>>> d = np.exp(c*1j)
>>> d
array([ 0.54030231+0.84147098j, -0.84147098+0.54030231j,
-0.54030231-0.84147098j])
>>> d.dtype.name
'complex128'
Many unary operations, such as computing the sum of all the elements in the array, are
implemented as methods of the ndarray class.
>>> a = np.random.random((2,3))
>>> a
array([[ 0.18626021, 0.34556073, 0.39676747],
[ 0.53881673, 0.41919451, 0.6852195 ]])
>>> a.sum()
2.5718191614547998
>>> a.min()
0.1862602113776709
>>> a.max()
0.6852195003967595
By default, these operations apply to the array as though it were a list of numbers, regardless of
its shape. However, by specifying the axis parameter you can apply an operation along the
specified axis of an array:
>>> b = np.arange(12).reshape(3,4)
>>> b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>>
>>> b.sum(axis=0) # sum of each column
array([12, 15, 18, 21])
>>>
>>> b.min(axis=1) # min of each row
array([0, 4, 8])
>>>
>>> b.cumsum(axis=1) # cumulative sum along each row
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38]])

Universal Functions
NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are
called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an
array, producing an array as output.
>>> B = np.arange(3)
>>> B
array([0, 1, 2])
>>> np.exp(B)
array([ 1. , 2.71828183, 7.3890561 ])
>>> np.sqrt(B)
array([ 0. , 1. , 1.41421356])
>>> C = np.array([2., -1., 4.])
>>> np.add(B, C)
array([ 2., 0., 6.])
See also
all, any, apply_along_axis, argmax, argmin, argsort, average, bincount, ceil, clip, conj, corrcoef,
cov, cross, cumprod, cumsum, diff, dot, floor, inner, inv, lexsort, max, maximum, mean, median,
min, minimum, nonzero, outer, prod, re, round, sort, std, sum, trace, transpose, var, vdot,
vectorize, where
Indexing, Slicing and Iterating
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other
Python sequences.
>>> a = np.arange(10)**3
>>> a
array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])
>>> a[2]
8
>>> a[2:5]
array([ 8, 27, 64])
>>> a[:6:2] = -1000 # equivalent to a[0:6:2] = -1000; from start to
position 6, exclusive, set every 2nd element to -1000
>>> a
array([-1000, 1, -1000, 27, -1000, 125, 216, 343, 512, 729])
>>> a[ : :-1] # reversed a
array([ 729, 512, 343, 216, 125, -1000, 27, -1000, 1, -1000])
>>> for i in a:
... print(i**(1/3.))
...
nan
1.0
nan
3.0
nan
5.0
6.0
7.0
8.0
9.0
Multidimensional arrays can have one index per axis. These indices are given in a tuple
separated by commas:
>>> def f(x,y):
... return 10*x+y
...
>>> b = np.fromfunction(f,(5,4),dtype=int)
>>> b
array([[ 0, 1, 2, 3],
[10, 11, 12, 13],
[20, 21, 22, 23],
[30, 31, 32, 33],
[40, 41, 42, 43]])
>>> b[2,3]
23
>>> b[0:5, 1] # each row in the second column of b
array([ 1, 11, 21, 31, 41])
>>> b[ : ,1] # equivalent to the previous example
array([ 1, 11, 21, 31, 41])
>>> b[1:3, : ] # each column in the second and third row
of b
array([[10, 11, 12, 13],
[20, 21, 22, 23]])
When fewer indices are provided than the number of axes, the missing indices are considered
complete slices:
>>> b[-1] # the last row. Equivalent to b[-
1,:]
array([40, 41, 42, 43])
The expression within brackets in b[i] is treated as an i followed by as many instances of : as
needed to represent the remaining axes. NumPy also allows you to write this using dots as b[i,...].
The dots (...) represent as many colons as needed to produce a complete indexing tuple. For
example, if x is a rank 5 array (i.e., it has 5 axes), then
• x[1,2,...] is equivalent to x[1,2,:,:,:],
• x[...,3] to x[:,:,:,:,3] and
• x[4,...,5,:] to x[4,:,:,5,:].
>>> c = np.array( [[[ 0, 1, 2], # a 3D array (two stacked 2D
arrays)
... [ 10, 12, 13]],
... [[100,101,102],
... [110,112,113]]])
>>> c.shape
(2, 2, 3)
>>> c[1,...] # same as c[1,:,:] or c[1]
array([[100, 101, 102],
[110, 112, 113]])
>>> c[...,2] # same as c[:,:,2]
array([[ 2, 13],
[102, 113]])
Iterating over multidimensional arrays is done with respect to the first axis:
>>> for row in b:
... print(row)
...
[0 1 2 3]
[10 11 12 13]
[20 21 22 23]
[30 31 32 33]
[40 41 42 43]
However, if one wants to perform an operation on each element in the array, one can use the flat
attribute which is an iterator over all the elements of the array:
>>> for element in b.flat:
... print(element)
...
0
1
2
3
10
11
12
13
20
21
22
23
30
31
32
33
40
41
42
43
See also
Indexing, Indexing (reference), newaxis, ndenumerate, indices
Shape Manipulation
Changing the shape of an array
An array has a shape given by the number of elements along each axis:
>>> a = np.floor(10*np.random.random((3,4)))
>>> a
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
>>> a.shape
(3, 4)
The shape of an array can be changed with various commands. Note that the following three
commands all return a modified array, but do not change the original array:
>>> a.ravel() # returns the array, flattened
array([ 2., 8., 0., 6., 4., 5., 1., 1., 8., 9., 3., 6.])
>>> a.reshape(6,2) # returns the array with a modified shape
array([[ 2., 8.],
[ 0., 6.],
[ 4., 5.],
[ 1., 1.],
[ 8., 9.],
[ 3., 6.]])
>>> a.T # returns the array, transposed
array([[ 2., 4., 8.],
[ 8., 5., 9.],
[ 0., 1., 3.],
[ 6., 1., 6.]])
>>> a.T.shape
(4, 3)
>>> a.shape
(3, 4)
The order of the elements in the array resulting from ravel() is normally “C-style”, that is, the
rightmost index “changes the fastest”, so the element after a[0,0] is a[0,1]. If the array is
reshaped to some other shape, again the array is treated as “C-style”. NumPy normally creates
arrays stored in this order, so ravel() will usually not need to copy its argument, but if the array
was made by taking slices of another array or created with unusual options, it may need to be
copied. The functions ravel() and reshape() can also be instructed, using an optional argument, to
use FORTRAN-style arrays, in which the leftmost index changes the fastest.
The reshape function returns its argument with a modified shape, whereas the ndarray.resize
method modifies the array itself:
>>> a
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
>>> a.resize((2,6))
>>> a
array([[ 2., 8., 0., 6., 4., 5.],
[ 1., 1., 8., 9., 3., 6.]])
If a dimension is given as -1 in a reshaping operation, the other dimensions are automatically
calculated:
>>> a.reshape(3,-1)
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
See also
ndarray.shape, reshape, resize, ravel
Stacking together different arrays
Several arrays can be stacked together along different axes:
>>> a = np.floor(10*np.random.random((2,2)))
>>> a
array([[ 8., 8.],
[ 0., 0.]])
>>> b = np.floor(10*np.random.random((2,2)))
>>> b
array([[ 1., 8.],
[ 0., 4.]])
>>> np.vstack((a,b))
array([[ 8., 8.],
[ 0., 0.],
[ 1., 8.],
[ 0., 4.]])
>>> np.hstack((a,b))
array([[ 8., 8., 1., 8.],
[ 0., 0., 0., 4.]])
The function column_stack stacks 1D arrays as columns into a 2D array. It is equivalent to
hstack only for 2D arrays:
>>> from numpy import newaxis
>>> np.column_stack((a,b)) # with 2D arrays
array([[ 8., 8., 1., 8.],
[ 0., 0., 0., 4.]])
>>> a = np.array([4.,2.])
>>> b = np.array([3.,8.])
>>> np.column_stack((a,b)) # returns a 2D array
array([[ 4., 3.],
[ 2., 8.]])
>>> np.hstack((a,b)) # the result is different
array([ 4., 2., 3., 8.])
>>> a[:,newaxis] # this allows to have a 2D columns vector
array([[ 4.],
[ 2.]])
>>> np.column_stack((a[:,newaxis],b[:,newaxis]))
array([[ 4., 3.],
[ 2., 8.]])
>>> np.hstack((a[:,newaxis],b[:,newaxis])) # the result is the same
array([[ 4., 3.],
[ 2., 8.]])
On the other hand, the function row_stack is equivalent to vstack for any input arrays. In general,
for arrays of with more than two dimensions, hstack stacks along their second axes, vstack stacks
along their first axes, and concatenate allows for an optional arguments giving the number of the
axis along which the concatenation should happen.
Note
In complex cases, r_ and c_ are useful for creating arrays by stacking numbers along one axis.
They allow the use of range literals (”:”)
>>> np.r_[1:4,0,4]
array([1, 2, 3, 0, 4])
When used with arrays as arguments, r_ and c_ are similar to vstack and hstack in their default
behavior, but allow for an optional argument giving the number of the axis along which to
concatenate.
See also
hstack, vstack, column_stack, concatenate, c_, r_
Splitting one array into several smaller ones
Using hsplit, you can split an array along its horizontal axis, either by specifying the number of
equally shaped arrays to return, or by specifying the columns after which the division should
occur:
>>> a = np.floor(10*np.random.random((2,12)))
>>> a
array([[ 9., 5., 6., 3., 6., 8., 0., 7., 9., 7., 2., 7.],
[ 1., 4., 9., 2., 2., 1., 0., 6., 2., 2., 4., 0.]])
>>> np.hsplit(a,3) # Split a into 3
[array([[ 9., 5., 6., 3.],
[ 1., 4., 9., 2.]]), array([[ 6., 8., 0., 7.],
[ 2., 1., 0., 6.]]), array([[ 9., 7., 2., 7.],
[ 2., 2., 4., 0.]])]
>>> np.hsplit(a,(3,4)) # Split a after the third and the fourth column
[array([[ 9., 5., 6.],
[ 1., 4., 9.]]), array([[ 3.],
[ 2.]]), array([[ 6., 8., 0., 7., 9., 7., 2., 7.],
[ 2., 1., 0., 6., 2., 2., 4., 0.]])]
vsplit splits along the vertical axis, and array_split allows one to specify along which axis to
split.

Python SciPy Tutorial: Learn with Example


What is SciPy?
SciPy is an Open Source Python-based library, which is used in mathematics, scientific
computing, Engineering, and technical computing.
SciPy also pronounced as "Sigh Pi."
Sub-packages of SciPy:
• File input/output - scipy.io
• Special Function - scipy.special
• Linear Algebra Operation - scipy.linalg
• Interpolation - scipy.interpolate
• Optimization and fit - scipy.optimize
• Statistics and random numbers - scipy.stats
• Numerical Integration - scipy.integrate
• Fast Fourier transforms - scipy.fftpack
• Signal Processing - scipy.signal
• Image manipulation – scipy.ndimage
In this tutorial, you will learn:
• What is SciPy?
• Why use SciPy
• Numpy VS SciPy
• SciPy - Installation and Environment Setup
• File Input / Output package:
• Special Function package:
• Linear Algebra with SciPy:
• Discrete Fourier Transform – scipy.fftpack
• Optimization and Fit in SciPy – scipy.optimize
• Nelder –Mead Algorithm:
• Image Processing with SciPy – scipy.ndimage

Why use SciPy


• SciPy contains varieties of sub packages which help to solve the most common issue
related to Scientific Computation.
• SciPy is the most used Scientific library only second to GNU Scientific Library for
C/C++ or Matlab's.
• Easy to use and understand as well as fast computational power.
• It can operate on an array of NumPy library.

Numpy VS SciPy
Numpy:
• Numpy is written in C and use for mathematical or numeric calculation.
• It is faster than other Python Libraries
• Numpy is the most useful library for Data Science to perform basic calculations.
• Numpy contains nothing but array data type which performs the most basic operation like
sorting, shaping, indexing, etc.
SciPy:
• SciPy is built in top of the NumPy
• SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few
features.
• Most new Data Science features are available in Scipy rather than Numpy.

SciPy - Installation and Environment Setup


You can also install SciPy in Windows via pip
Python3 -m pip install --user numpy scipy
Install Scipy on Linux
sudo apt-get install python-scipy python-numpy
Install SciPy in Mac
sudo port install py35-scipy py35-numpy
Before start to learning SciPy, you need to know basic functionality as well as different types of
an array of NumPy
The standard way of import infSciPy modules and Numpy:
from scipy import special #same for other modules
import numpy as np

File Input / Output package:


Scipy, I/O package, has a wide range of functions for work with different files format which are
Matlab, Arff, Wave, Matrix Market, IDL, NetCDF, TXT, CSV and binary format.
Let's we take one file format example as which are regularly use of MatLab:
import numpy as np
from scipy import io as sio
array = np.ones((4, 4))
sio.savemat('example.mat', {'ar': array})
data = sio.loadmat(‘example.mat', struct_as_record=True)
data['array']
Output:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
Code Explanation
• Line 1 & 2: Import the essential library scipy with i/o package and Numpy.
• Line 3: Create 4 x 4, dimensional one's array
• Line 4: Store array in example.mat file.
• Line 5: Get data from example.mat file
• Line 6: Print output.

Special Function package


• scipy.special package contains numerous functions of mathematical physics.
• SciPy special function includes Cubic Root, Exponential, Log sum Exponential, Lambert,
Permutation and Combinations, Gamma, Bessel, hypergeometric, Kelvin, beta, parabolic
cylinder, Relative Error Exponential, etc..
• For one line description all of these function, type in Python console:
help(scipy.special)
Output :
NAME
scipy.special

DESCRIPTION
========================================
Special functions (:mod:`scipy.special`)
========================================

.. module:: scipy.special

Nearly all of the functions below are universal functions and follow
broadcasting and automatic array-looping rules. Exceptions are noted.

Cubic Root Function:


Cubic Root function finds the cube root of values.
Syntax:
scipy.special.cbrt(x)
Example:
from scipy.special import cbrt
#Find cubic root of 27 & 64 using cbrt() function
cb = cbrt([27, 64])
#print value of cb
print(cb)
Output: array([3., 4.])
Exponential Function:
Exponential function computes the 10**x element-wise.
Example:
from scipy.special import exp10
#define exp10 function and pass value in its
exp = exp10([1,10])
print(exp)
Output: [1.e+01 1.e+10]
Permutations & Combinations:
SciPy also gives functionality to calculate Permutations and Combinations.
Combinations - scipy.special.comb(N,k)
Example:
from scipy.special import comb
#find combinations of 5, 2 values using comb(N, k)
com = comb(5, 2, exact = False, repetition=True)
print(com)
Output: 15.0
Permutations –
scipy.special.perm(N,k)
Example:
from scipy.special import perm
#find permutation of 5, 2 using perm (N, k) function
per = perm(5, 2, exact = True)
print(per)
Output: 20
Log Sum Exponential Function
Log Sum Exponential computes the log of sum exponential input element.
Syntax :
scipy.special.logsumexp(x)

Bessel Function
Nth integer order calculation function
Syntax :
scipy.special.jn()

Linear Algebra with SciPy


• Linear Algebra of SciPy is an implementation of BLAS and ATLAS LAPACK libraries.
• Performance of Linear Algebra is very fast compared to BLAS and LAPACK.
• Linear algebra routine accepts two-dimensional array object and output is also a two-
dimensional array.
Now let's do some test with scipy.linalg,
Calculating determinant of a two-dimensional matrix,
from scipy import linalg
import numpy as np
#define square matrix
two_d_array = np.array([ [4,5], [3,2] ])
#pass values to det() function
linalg.det( two_d_array )
Output: -7.0
Inverse Matrix –
scipy.linalg.inv()
Inverse Matrix of Scipy calculates the inverse of any square matrix.
Let's see,
from scipy import linalg
import numpy as np
# define square matrix
two_d_array = np.array([ [4,5], [3,2] ])
#pass value to function inv()
linalg.inv( two_d_array )
Output:
array( [[-0.28571429, 0.71428571],
[ 0.42857143, -0.57142857]] )
Eigenvalues and Eigenvector – scipy.linalg.eig()
• The most common problem in linear algebra is eigenvalues and eigenvector which can be
easily solved using eig() function.
• Now lets we find the Eigenvalue of (X) and correspond eigenvector of a two-dimensional
square matrix.
Example,
from scipy import linalg
import numpy as np
#define two dimensional array
arr = np.array([[5,4],[6,3]])
#pass value into function
eg_val, eg_vect = linalg.eig(arr)
#get eigenvalues
print(eg_val)
#get eigenvectors
print(eg_vect)
Output:
[ 9.+0.j -1.+0.j] #eigenvalues
[ [ 0.70710678 -0.5547002 ] #eigenvectors
[ 0.70710678 0.83205029] ]

Discrete Fourier Transform – scipy.fftpack


• DFT is a mathematical technique which is used in converting spatial data into frequency
data.
• FFT (Fast Fourier Transformation) is an algorithm for computing DFT
• FFT is applied to a multidimensional array.
• Frequency defines the number of signal or wavelength in particular time period.
Example: Take a wave and show using Matplotlib library. we take simple periodic function
example of sin(20 × 2πt)
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

#Frequency in terms of Hertz


fre = 5
#Sample rate
fre_samp = 50
t = np.linspace(0, 2, 2 * fre_samp, endpoint = False )
a = np.sin(fre * 2 * np.pi * t)
figure, axis = plt.subplots()
axis.plot(t, a)
axis.set_xlabel ('Time (s)')
axis.set_ylabel ('Signal amplitude')
plt.show()
Output :

You can see this. Frequency is 5 Hz and its signal repeats in 1/5 seconds – it's call as a particular
time period.
Now let us use this sinusoid wave with the help of DFT application.
from scipy import fftpack

A = fftpack.fft(a)
frequency = fftpack.fftfreq(len(a)) * fre_samp
figure, axis = plt.subplots()

axis.stem(frequency, np.abs(A))
axis.set_xlabel('Frequency in Hz')
axis.set_ylabel('Frequency Spectrum Magnitude')
axis.set_xlim(-fre_samp / 2, fre_samp/ 2)
axis.set_ylim(-5, 110)
plt.show()
Output:
• You can clearly see that output is a one-dimensional array.
• Input containing complex values are zero except two points.
• In DFT example we visualize the magnitude of the signal.

Optimization and Fit in SciPy – scipy.optimize


• Optimization provides a useful algorithm for minimization of curve fitting,
multidimensional or scalar and root fitting.
• Let's take an example of a Scalar Function, to find minimum scalar function.
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import optimize
import numpy as np

def function(a):
return a*2 + 20 * np.sin(a)
plt.plot(a, function(a))
plt.show()
#use BFGS algorithm for optimization
optimize.fmin_bfgs(function, 0)
Output:
Optimization terminated successfully.
Current function value: -23.241676
Iterations: 4
Function evaluations: 18
Gradient evaluations: 6
array([-1.67096375])
• In this example, optimization is done with the help of the gradient descent algorithm from
the initial point
• But the possible issue is local minima instead of global minima. If we don't find a
neighbor of global minima, then we need to apply global optimization and find global
minima function used as basinhopping() which combines local optimizer.
optimize.basinhopping(function, 0)
Output:
fun: -23.241676238045315
lowest_optimization_result:
fun: -23.241676238045315
hess_inv: array([[0.05023331]])
jac: array([4.76837158e-07])
message: 'Optimization terminated successfully.'
nfev: 15
nit: 3
njev: 5
status: 0
success: True
x: array([-1.67096375])
message: ['requested number of basinhopping iterations
completed successfully']
minimization_failures: 0
nfev: 1530
nit: 100
njev: 510
x: array([-1.67096375])

Nelder –Mead Algorithm:


• Nelder-Mead algorithm selects through method parameter.
• It provides the most straightforward way of minimization for fair behaved function.
• Nelder – Mead algorithm is not used for gradient evaluations because it may take a
longer time to find the solution.
import numpy as np
from scipy.optimize import minimize
#define function f(x)
def f(x):
return .4*(1 - x[0])**2

optimize.minimize(f, [2, -1], method="Nelder-Mead")


Output:
final_simplex: (array([[ 1. , -1.27109375],
[ 1. , -1.27118835],
[ 1. , -1.27113762]]), array([0., 0., 0.]))
fun: 0.0
message: 'Optimization terminated successfully.'
nfev: 147
nit: 69
status: 0
success: True
x: array([ 1. , -1.27109375])

Image Processing with SciPy – scipy.ndimage


• scipy.ndimage is a submodule of SciPy which is mostly used for performing an image
related operation
• ndimage means the "n" dimensional image.
• SciPy Image Processing provides Geometrics transformation (rotate, crop, flip), image
filtering (sharp and de nosing), display image, image segmentation, classification and
features extraction.
• MISC Package in SciPy contains prebuilt images which can be used to perform image
manipulation task
Example: Let's take a geometric transformation example of images
from scipy import misc
from matplotlib import pyplot as plt
import numpy as np
#get face image of panda from misc package
panda = misc.face()
#plot or show image of face
plt.imshow( panda )
plt.show()
Output:

Now we Flip-down current image:


#Flip Down using scipy misc.face image
flip_down = np.flipud(misc.face())
plt.imshow(flip_down)
plt.show()
Output:
Example: Rotation of Image using Scipy,
from scipy import ndimage, misc
from matplotlib import pyplot as plt
panda = misc.face()
#rotatation function of scipy for image – image rotated 135 degree
panda_rotate = ndimage.rotate(panda, 135)
plt.imshow(panda_rotate)
plt.show()
Output:
Integration with Scipy – Numerical Integration
• When we integrate any function where analytically integrate is not possible, we need to
turn for numerical integration
• SciPy provides functionality to integrate function with numerical integration.
• scipy.integrate library has single integration, double, triple, multiple, Gaussian quadrate,
Romberg, Trapezoidal and Simpson's rules.
Example: Now take an example of Single Integration

Here a is the upper limit and b is the lower limit


from scipy import integrate
# take f(x) function as f
f = lambda x : x**2
#single integration with a = 0 & b = 1
integration = integrate.quad(f, 0 , 1)
print(integration)
Output:
(0.33333333333333337, 3.700743415417189e-15)
Here function returns two values, in which the first value is integration and second value is
estimated error in integral.
Example: Now take an example of double integration. We find the double integration of the
following equation,

from scipy import integrate


import numpy as np
#import square root function from math lib
from math import sqrt
# set fuction f(x)
f = lambda x, y : 64 *x*y
# lower limit of second integral
p = lambda x : 0
# upper limit of first integral
q = lambda y : sqrt(1 - 2*y**2)
# perform double integration
integration = integrate.dblquad(f , 0 , 2/4, p, q)
print(integration)
Output:
(3.0, 9.657432734515774e-14)
You have seen that above output as same previous one.

Summary
• SciPy(pronounced as "Sigh Pi") is an Open Source Python-based library, which is used in
mathematics, scientific computing, Engineering, and technical computing.
• SciPy contains varieties of sub packages which help to solve the most common issue
related to Scientific Computation.
• SciPy is built in top of the NumPy
Package Name Description
scipy.io • File input/output
scipy.special • Special Function
scipy.linalg • Linear Algebra Operation
scipy.interpolate • Interpolation
scipy.optimize • Optimization and fit
scipy.stats • Statistics and random numbers
scipy.integrate • Numerical Integration
scipy.fftpack • Fast Fourier transforms
scipy.signal • Signal Processing
scipy.ndimage • Image manipulation –

An introduction to machine learning with


scikit-learn
Section contents
In this section, we introduce the machine learning vocabulary that we use throughout scikit-learn
and give a simple learning example.

Machine learning: the problem setting


In general, a learning problem considers a set of n samples of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
• supervised learning, in which the data comes with additional attributes that we
want to predict (Click here to go to the scikit-learn supervised learning page).This
problem can be either:
• classification: samples belong to two or more classes and we want
to learn from already labeled data how to predict the class of
unlabeled data. An example of a classification problem would be
handwritten digit recognition, in which the aim is to assign each
input vector to one of a finite number of discrete categories.
Another way to think of classification is as a discrete (as opposed
to continuous) form of supervised learning where one has a limited
number of categories and for each of the n samples provided, one
is to try to label them with the correct category or class.
• regression: if the desired output consists of one or more continuous
variables, then the task is called regression. An example of a
regression problem would be the prediction of the length of a
salmon as a function of its age and weight.
• unsupervised learning, in which the training data consists of a set of input vectors
x without any corresponding target values. The goal in such problems may be to
discover groups of similar examples within the data, where it is called clustering,
or to determine the distribution of data within the input space, known as density
estimation, or to project the data from a high-dimensional space down to two or
three dimensions for the purpose of visualization (Click here to go to the Scikit-
Learn unsupervised learning page).
Training set and testing set
Machine learning is about learning some properties of a data set and then testing those properties
against another data set. A common practice in machine learning is to evaluate an algorithm by
splitting a data set into two. We call one of those sets the training set, on which we learn some
properties; we call the other set the testing set, on which we test the learned properties.

Loading an example dataset


scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for
classification and the boston house prices dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits
datasets. Our notational convention is that $ denotes the shell prompt while >>> denotes the
Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case
of supervised problem, one or more response variables are stored in the .target member. More
details on the different datasets can be found in the dedicated section.
For instance, in the case of the digits dataset, digits.data gives access to the features that can
be used to classify the digits samples:
>>>
>>> print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding
to each digit image that we are trying to learn:
>>>
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])
Shape of the data arrays
The data is always a 2D array, shape (n_samples, n_features), although the original data
may have had a different shape. In the case of the digits, each original sample is an image of
shape (8, 8) and can be accessed using:
>>>
>>> digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])
The simple example on this dataset illustrates how starting from the original problem one can
shape the data for consumption in scikit-learn.
Loading from external datasets
To load from an external dataset, please refer to loading external datasets.

Learning and predicting


In the case of the digits dataset, the task is to predict, given an image, which digit it represents.
We are given samples of each of the 10 possible classes (the digits zero through nine) on which
we fit an estimator to be able to predict the classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods
fit(X, y) and predict(T).
An example of an estimator is the class sklearn.svm.SVC, which implements support vector
classification. The estimator’s constructor takes as arguments the model’s parameters.
For now, we will consider the estimator as a black box:
>>>
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)
Choosing the parameters of the model
In this example, we set the value of gamma manually. To find good values for these parameters,
we can use tools such as grid search and cross validation.
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from
the model. This is done by passing our training set to the fit method. For the training set, we’ll
use all the images from our dataset, except for the last image, which we’ll reserve for our
predicting. We select the training set with the [:-1] Python syntax, which produces a new array
that contains all but the last item from digits.data:
>>>
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Now you can predict new values. In this case, you’ll predict using the last image from
digits.data. By predicting, you’ll determine the image from the training set that best matches
the last image.
>>>
>>> clf.predict(digits.data[-1:])
array([8])
The corresponding image is:

As you can see, it is a challenging task: after all, the images are of poor resolution. Do you agree
with the classifier?
A complete example of this classification problem is available as an example that you can run
and study: Recognizing hand-written digits.

Model persistence
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle:
>>>
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC(gamma='scale')
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

>>> import pickle


>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0
In the specific case of scikit-learn, it may be more interesting to use joblib’s replacement for
pickle (joblib.dump & joblib.load), which is more efficient on big data but it can only pickle
to the disk and not to a string:
>>>
>>> from joblib import dump, load
>>> dump(clf, 'filename.joblib')
Later, you can reload the pickled model (possibly in another Python process) with:
>>>
>>> clf = load('filename.joblib')
Note
joblib.dump and joblib.load functions also accept file-like object instead of filenames. More
information on data persistence with Joblib is available here.
Note that pickle has some security and maintainability issues. Please refer to section Model
persistence for more detailed information about model persistence with scikit-learn.

Conventions
scikit-learn estimators follow certain rules to make their behavior more predictive. These are
described in more detail in the Glossary of Common Terms and API Elements.
Type casting
Unless otherwise specified, input will be cast to float64:
>>>
>>> import numpy as np
>>> from sklearn import random_projection

>>> rng = np.random.RandomState(0)


>>> X = rng.rand(10, 2000)
>>> X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')

>>> transformer = random_projection.GaussianRandomProjection()


>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')
In this example, X is float32, which is cast to float64 by fit_transform(X).
Regression targets are cast to float64 and classification targets are maintained:
>>>
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> iris = datasets.load_iris()
>>> clf = SVC(gamma='scale')
>>> clf.fit(iris.data, iris.target)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]

>>> clf.fit(iris.data, iris.target_names[iris.target])


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']
Here, the first predict() returns an integer array, since iris.target (an integer array) was
used in fit. The second predict() returns a string array, since iris.target_names was for
fitting.
Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the
set_params() method. Calling fit() more than once will overwrite what was learned by any
previous fit():
>>>
>>> import numpy as np
>>> from sklearn.svm import SVC

>>> rng = np.random.RandomState(0)


>>> X = rng.rand(100, 10)
>>> y = rng.binomial(1, 0.5, 100)
>>> X_test = rng.rand(5, 10)

>>> clf = SVC()


>>> clf.set_params(kernel='linear').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([1, 0, 1, 1, 0])

>>> clf.set_params(kernel='rbf', gamma='scale').fit(X, y)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([0, 0, 0, 1, 0])
Here, the default kernel rbf is first changed to linear via SVC.set_params() after the
estimator has been constructed, and changed back to rbf to refit the estimator and to make a
second prediction.
Multiclass vs. multilabel fitting
When using multiclass classifiers, the learning and prediction task that is performed is
dependent on the format of the target data fit upon:
>>>
>>> from sklearn.svm import SVC
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.preprocessing import LabelBinarizer

>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]

>>> classif = OneVsRestClassifier(estimator=SVC(gamma='scale',


... random_state=0))
>>> classif.fit(X, y).predict(X)
array([0, 0, 1, 1, 2])
In the above case, the classifier is fit on a 1d array of multiclass labels and the predict()
method therefore provides corresponding multiclass predictions. It is also possible to fit upon a
2d array of binary label indicators:
>>>
>>> y = LabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 0]])
Here, the classifier is fit() on a 2d binary label representation of y, using the LabelBinarizer.
In this case predict() returns a 2d array representing the corresponding multilabel predictions.
Note that the fourth and fifth instances returned all zeroes, indicating that they matched none of
the three labels fit upon. With multilabel outputs, it is similarly possible for an instance to be
assigned multiple labels:
>>>
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
>>> y = MultiLabelBinarizer().fit_transform(y)
>>> classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
In this case, the classifier is fit upon instances each assigned multiple labels. The
MultiLabelBinarizer is used to binarize the 2d array of multilabels to fit upon. As a result,
predict() returns a 2d array with multiple predicted labels for each instance.
What is Natural Language Processing?
Natural Language Processing (NLP) is a branch of AI that helps computers to understand,
interpret and manipulate human language.
NLP helps developers to organize and structure knowledge to perform tasks like translation,
summarization, named entity recognition, relationship extraction, speech recognition, topic
segmentation, etc.
NLP is a way of computers to analyze, understand and derive meaning from a human language
such as English, Spanish, Hindi, etc.
In this tutorial, you will learn:
• What is Natural Language Processing?
• History of NLP
• How does NLP work?
• Components of NLP
• NLP and writing systems
• How to implement NLP
• NLP Examples
• Future of NLP
• Natural language vs. Computer Language
• Advantages of NLP
• Disadvantages of NLP

History of NLP
Here, is are important events in the history of Natural Language Processing:
1950- NLP started when Alan Turing published an article called "Machine and Intelligence."
1950- Attempts to automate translation between Russian and English
1960- The work of Chomsky and others on formal language theory and generative syntax
1990- Probabilistic and data-driven models had become quite standard
2000- A Large amount of spoken and textual data become available

How does NLP work?


Before we learn how NLP works, let's understand how humans use language-
Every day, we say thousand of a word that other people interpret to do countless things. We,
consider it as a simple communication, but we all know that words run much deeper than that.
There is always some context that we derive from what we say and how we say it., NLP never
focuses on voice modulation; it does draw on contextual patterns.
Example:
Man is to woman as king is to __________?
Meaning (king) – meaning (man) + meaning ( woman)=?
The answer is- queen
Here, we can easily co-relate because man is male gender and woman is female gender. In the
same way, the king is masculine gender, and its female gender is queen.
Example:
Is King to kings as the queen is to_______?
The answer is--- queens
Here, we can see two words kings and kings where one is singular and other is plural. Therefore,
when the world queen comes, it automatically co-relates with queens again singular plural.
Here, the biggest question is that how do we know what words mean? Let's, say who will call it
queen?

The answer is we learn this thinks through experience. However, here the main question is that
how computer know about the same?
We need to provide enough data for Machines to learn through experience. We can feed details
like
• Her Majesty the Queen.
• The Queen's speech during the State visit
• The crown of Queen Elizabeth
• The Queens's Mother
• The queen is generous.
With above examples the machine understands the entity Queen.
The machine creates word vectors as below. A word vector is built using surrounding words.

The machine creates these vectors


• As it learns from multiple datasets
• Use Machine learning (e.g., Deep Learning algorithms)
• A word vector is built using surrounding words.
Here is the formula:
Meaning (king) – meaning (man) + meaning (woman)=?
This amounts to performing simple algebraic operations on word vectors:
Vector ( king) – vector (man) + vector (woman)= vector(?)
To which the machine answers queen.

Components of NLP
Five main Component of Natural Language processing are:
• Morphological and Lexical Analysis
• Syntactic Analysis
• Semantic Analysis
• Discourse Integration
• Pragmatic Analysis
Morphological and Lexical Analysis
Lexical analysis is a vocabulary that includes its words and expressions. It depicts analyzing,
identifying and description of the structure of words. It includes dividing a text into paragraphs,
words and the sentences
Individual words are analyzed into their components, and nonword tokens such as punctuations
are separated from the words.
Semantic Analysis
Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings. This
component transfers linear sequences of words into structures. It shows how the words are
associated with each other.
Semantics focuses only on the literal meaning of words, phrases, and sentences. This only
abstracts the dictionary meaning or the real meaning from the given context. The structures
assigned by the syntactic analyzer always have assigned meaning
E.g.. "colorless green idea." This would be rejected by the Symantec analysis as colorless Here;
green doesn't make any sense.
Pragmatic Analysis
Pragmatic Analysis deals with the overall communicative and social content and its effect on
interpretation. It means abstracting or deriving the meaningful use of language in situations. In
this analysis, the main focus always on what was said in reinterpreted on what is meant.
Pragmatic analysis helps users to discover this intended effect by applying a set of rules that
characterize cooperative dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.
Syntax analysis
The words are commonly accepted as being the smallest units of syntax. The syntax refers to the
principles and rules that govern the sentence structure of any individual languages.
Syntax focus about the proper ordering of words which can affect its meaning. This involves
analysis of the words in a sentence by following the grammatical structure of the sentence. The
words are transformed into the structure to show hows the word are related to each other.
Discourse Integration
It means a sense of the context. The meaning of any single sentence which depends upon that
sentences. It also considers the meaning of the following sentence.
For example, the word "that" in the sentence "He wanted that" depends upon the prior discourse
context.

NLP and writing systems


The kind of writing system used for a language is one of the deciding factors in determining the
best approach for text pre-processing. Writing systems can be
• Logographic: a Large number of individual symbols represent words. Example Japanese,
Mandarin
• Syllabic: Individual symbols represent syllables
• Alphabetic: Individual symbols represent sound
Majority of the writing systems use the Syllabic or Alphabetic system. Even English, with its
relatively simple writing system based on the Roman alphabet, utilizes logographic symbols
which include Arabic numerals, Currency symbols (S, £), and other special symbols.
This pose following challenges
• Extracting meaning(semantics) from a text is a challenge
• NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to
understand context.
• There is a dependence on the character set and language

How to implement NLP


Below, given are popular methods used for Natural Learning Process:
Machine learning: The learning procedures used during machine learning. It automatically
focuses on the most common cases. So when we write rules by hand, it is often not correct at all
concerned about human errors.
Statistical inference: NLP can make use of statistical inference algorithms. It helps you to
produce models that are robust. e.g., containing words or structures which are known to
everyone.

NLP Examples
Today, Natual process learning technology is widely used technology.
Here, are common Application' of NLP:
Information retrieval & Web Search
Google, Yahoo, Bing, and other search engines base their machine translation technology on
NLP deep learning models. It allows algorithms to read text on a webpage, interpret its meaning
and translate it to another language.
Grammar Correction:
NLP technique is widely used by word processor software like MS-word for spelling correction
& grammar checking.
Question Answering
Type in keywords to ask Questions in Natural Language.
Text Summarization
The process of summarising important information from a source to produce a shortened version
Machine Translation
Use of computer applications to translate text or speech from one natural language to another.

Sentiment analysis
NLP helps companies to analyze a large number of reviews on a product. It also allows their
customers to give a review of the particular product.
Future of NLP
• Human readable natural language processing is the biggest Al- problem. It is all most
same as solving the central artificial intelligence problem and making computers as
intelligent as people.
• Future computers or machines with the help of NLP will able to learn from the
information online and apply that in the real world, however, lots of work need to on this
regard.
• Combined with natural language generation, computers will become more capable of
receiving and giving useful and resourceful information or data.

Natural language vs. Computer Language


Parameter Natural Language Computer Languages
Ambiguous They are ambiguous in nature. They are designed to unambiguous.
Natural languages employ lots of
Redundancy Formal languages are less redundant.
redundancy.
Natural languages are made of idiom & Formal languages mean exactly what they
Literalness
metaphor want to say

Advantages of NLP
• Users can ask questions about any subject and get a direct response within seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or unwanted
information
• The accuracy of the answers increases with the amount of relevant information provided
in the question.
• NLP process helps computers communicate with humans in their language and scales
other language-related tasks
• Allows you to perform more language-based data compares to a human being without
fatigue and in an unbiased and consistent way.
• Structuring a highly unstructured data source

Disadvantages of NLP
• Complex Query Language- the system may not be able to provide the correct answer it
the question that is poorly worded or ambiguous.
• The system is built for a single and specific task only; it is unable to adapt to new
domains and problems because of limited functions.
• NLP system doesn't have a user interface which lacks features that allow users to further
interact with the system

Summary
• Natural Language Processing is a branch of AI which helps computers to understand,
interpret and manipulate human language
• NLP started when Alan Turing published an article called "Machine and Intelligence".
• NLP never focuses on voice modulation; it does draw on contextual patterns
• Five essential components of Natural Language processing are 1) Morphological and
Lexical Analysis 2)Syntactic Analysis 3) Semantic Analysis 4) Discourse Integration 5)
Pragmatic Analysis
• Three types of the Natural process writing system are 1)Logographic 2) Syllabic 3)
Alphabetic
• Machine learning and Statistical inference are two methods to implement Natural Process
Learning
• Essential Applications of NLP are Information retrieval & Web Search, Grammar
Correction Question Answering, , Text Summarization, Machine Translation, etc.
• Future computers or machines with the help of NLP will able to learn from the
information online and apply that in the real world, however, lots of work need to on this
regard
• NLP is are ambiguous while computer language is designed to unambiguous
• The biggest advantage of the NLP system is that it offers exact answers to the questions,
no unnecessary or unwanted information
• The biggest draw back of the NLP system is built for a single and specific task only so it
is unable to adapt to new domains and problems because of limited functions

Tokenize Words and Sentences with NLTK


What is Tokenization?
Tokenization is the process by which big quantity of text is divided into smaller parts called
tokens.
Natural language processing is used for building applications such as Text classification,
intelligent chatbot, sentimental analysis, language translation, etc. It becomes vital to understand
the pattern in the text to achieve the above-stated purpose. These tokens are very useful for
finding such patterns as well as is considered as a base step for stemming and
lemmatization.
For the time being, don't worry about stemming and lemmatization but treat them as steps for
textual data cleaning using NLP (Natural language processing). We will discuss stemming and
lemmatization later in the tutorial. Tasks such as Text classification or spam filtering makes
use of NLP along with deep learning libraries such as Keras and Tensorflow.
Natural Language toolkit has very important module tokenize which further compromises of
sub-modules
• word tokenize
• sentence tokenize
Tokenization of words
We use the method word_tokenize() to split a sentence into words. The output of word
tokenization can be converted to Data Frame for better text understanding in machine learning
applications. It can also be provided as input for further text cleaning steps such as punctuation
removal, numeric character removal or stemming. Machine learning models need numeric data
to be trained and make a prediction. Word tokenization becomes a crucial part of the text (string)
to numeric data conversion. Please read about Bag of Words or CountVectorizer. Please refer to
below example to understand the theory better.
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

Code Explanation
• word_tokenize module is imported from the NLTK library.
• A variable "text" is initialized with two sentences.
• Text variable is passed in word_tokenize module and printed the result. This module
breaks each word with punctuation which you can see in the output.

Tokenization of Sentences
Sub-module available for the above is sent_tokenize. An obvious question in your mind would
be why sentence tokenization is needed when we have the option of word tokenization.
Imagine you need to count average words per sentence, how you will calculate? For
accomplishing such a task, you need both sentence tokenization as well as words to calculate the
ratio. Such output serves as an important feature for machine training as the answer would be
numeric.
Check the below example to learn how sentence tokenization is different from words
tokenization.
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']


We have 12 words and two sentences for the same input.

Explanation of the program:


• In a line like the previous program, imported the sent_tokenize module.
• We have taken the same sentence. Further sent module parsed that sentences and show
output. It is clear that this function breaks each sentence.
Above examples are good settings stones to understand the mechanics of the word and sentence
tokenization.

POS Tagging
Parts of speech Tagging is responsible for reading the text in a language and assigning some
specific token (Parts of Speech) to each word.
e.g.
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved:
• Tokenize text (word_tokenize)
• apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Some examples are as below:
Abbreviation Meaning

CC coordinating conjunction

CD cardinal digit

DT determiner

EX existential there

FW foreign word

IN preposition/subordinating conjunction

JJ adjective (large)

JJR adjective, comparative (larger)


JJS adjective, superlative (largest)

LS list market

MD modal (could, will)

NN noun, singular (cat, tree)

NNS noun plural (desks)

NNP proper noun, singular (sarah)

NNPS proper noun, plural (indians or americans)

PDT predeterminer (all, both, half)

POS possessive ending (parent\ 's)

PRP personal pronoun (hers, herself, him,himself)

PRP$ possessive pronoun (her, his, mine, my, our )

RB adverb (occasionally, swiftly)

RBR adverb, comparative (greater)

RBS adverb, superlative (biggest)

RP particle (about)

TO infinite marker (to)

UH interjection (goodbye)

VB verb (ask)

VBG verb gerund (judging)

VBD verb past tense (pleaded)

VBN verb past participle (reunified)

VBP verb, present tense not 3rd person singular(wrap)

VBZ verb, present tense with 3rd person singular (bases)

WDT wh-determiner (that, what)

WP wh- pronoun (who)


WRB wh- adverb (how)

POS tagger is used to assign grammatical information of each word of the sentence. Installing,
Importing and downloading all the packages of NLTK is complete.

Chunking
Chunking is used to add more structure to the sentence by following parts of speech (POS)
tagging. It is also known as shallow parsing. The resulted group of words is called "chunks." In
shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow Parsing is also called light parsing or chunking.
The primary usage of chunking is to make a group of "noun phrases." The parts of speech are
combined with regular expressions.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from
the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Following table shows what the various symbol means:
Name of symbol Description

. Any character except new line

* Match 0 or more repetitions

? Match 0 or 1 repetitions

Now Let us write the code to understand rule better


from nltk import pos_tag
from nltk import RegexpParser
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
Output
After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study',
'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99',
'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
<ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
(mychunk learn/JJ)
(mychunk php/NN)
from/IN
(mychunk guru99/NN and/CC)
make/VB
(mychunk study/NN easy/JJ))
The conclusion from the above example: "make" is a verb which is not included in the rule, so it
is not tagged as mychunk
Use Case of Chunking
Chunking is used for entity detection. An entity is that part of the sentence by which machine get
the value for any intention
Example:
Temperature of New York.
Here Temperature is the intention and New York is an entity.
In other words, chunking is used as selecting the subsets of tokens. Please follow the below code
to understand how chunking is used to select the tokens. In this example, you will see the graph
which will correspond to a chunk of a noun phrase. We will write the code and draw the graph
for better understanding.
Code to Demonstrate Use Case
import nltk
text = "learn php from guru99"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw() # It will draw the pattern graphically which can be seen in
Noun Phrase chunking
Output:
['learn', 'php', 'from', 'guru99'] -- These are the tokens
[('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')] -- These
are the pos_tag
(S (NP learn/JJ php/NN) from/IN (NP guru99/NN)) -- Noun Phrase
Chunking
Graph
Noun Phrase chunking Graph
From the graph, we can conclude that "learn" and "guru99" are two different tokens but are
categorized as Noun Phrase whereas token "from" does not belong to Noun Phrase.
Chunking is used to categorize different tokens into the same chunk. The result will depend on
grammar which has been selected. Further chunking is used to tag patterns and to explore text
corpora.

Stemming and Lemmatization with Python


NLTK
What is Stemming?
Stemming is a kind of normalization for words. Normalization is a technique where a set of
words in a sentence are converted into a sequence to shorten its lookup. The words which have
the same meaning but have some variation according to the context or sentence are normalized.
In another word, there is one root word, but there are many variations of the same words. For
example, the root word is "eat" and it's variations are "eats, eating, eaten and like so". In the
same way, with the help of Stemming, we can find the root word of any variations.
For example
He was riding.
He was taking the ride.
In the above two sentences, the meaning is the same, i.e., riding activity in the past. A human can
easily understand that both meanings are the same. But for machines, both sentences are
different. Thus it became hard to convert it into the same data row. In case we do not provide the
same data-set, then machine fails to predict. So it is necessary to differentiate the meaning of
each word to prepare the dataset for machine learning. And here stemming is used to categorize
the same type of data by getting its root word.
Let's implement this with a Python program.NLTK has an algorithm named as "PorterStemmer".
This algorithm accepts the list of tokenized word and stems it into root word.
Program for understanding Stemming
from nltk.stem import PorterStemmers
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)
Output:
wait
wait
wait
wait

Code Explanation:
• There is a stem module in NLTk which is imported. If ifyou import the complete module,
then the program becomes heavy as it contains thousands of lines of codes. So from the
entire stem module, we only imported "PorterStemmer."
• We prepared a dummy list of variation data of the same word.
• An object is created which belongs to class nltk.stem.porter.PorterStemmer.
• Further, we passed it to PorterStemmer one by one using "for" loop. Finally, we got
output root word of each word mentioned in the list.
From the above explanation, it can also be concluded that stemming is considered as an
important preprocessing step because it removed redundancy in the data and variations in the
same word. As a result, data is filtered which will help in better machine training.
Now we pass a complete sentence and check for its behavior as an output.
Program:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello Guru99, You have to build a very good site and I love
visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)
Output:
hello
guru99
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site

Code Explanation
• Package PorterStemer is imported from module stem
• Packages for tokenization of sentence as well as words are imported
• A sentence is written which is to be tokenized in the next step.
• Word tokenization is implemented in this step.
• An object for PorterStemmer is created here.
• Loop is run and stemming of each word is done using the object created in the code line 5
Conclusion:
Stemming is a data-preprocessing module. The English language has many variations of a single
word. These variations create ambiguity in machine learning training and prediction. To create a
successful model, it's vital to filter such words and convert to the same type of sequenced data
using stemming. Also, this is an important technique to get row data from a set of sentence and
removal of redundant data also known as normalization.

What is Lemmatization?
Lemmatization is the algorithmic process of finding the lemma of a word depending on their
meaning. Lemmatization usually refers to the morphological analysis of words, which aims to
remove inflectional endings. It helps in returning the base or dictionary form of a word, which is
known as the lemma. The NLTK Lemmatization method is based on WorldNet's built-in morph
function. Text preprocessing includes both stemming as well as lemmatization. Many people find
the two terms confusing. Some treat these as same, but there is a difference between these both.
Lemmatization is preferred over the former because of the below reason.

Why is Lemmatization better than Stemming?


Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the
beginning or end of the word.
On the contrary, Lemmatization is a more powerful operation, and it takes into consideration
morphological analysis of the words. It returns the lemma which is the base form of all its
inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for
the proper form of the word. Stemming is a general operation while lemmatization is an
intelligent operation where the proper form will be looked in the dictionary. Hence,
lemmatization helps in forming better machine learning features.

Code to distinguish between Lemmatization and Stemming


Stemming code
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))
Output:
Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri
Lemmatization code
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))
Output:
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry

WordNet with NLTK: Finding Synonyms for


words in Python
What is Wordnet?
Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to find the
meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary
of English. It is imported with the following command:
from nltk.corpus import wordnet as guru
Stats reveal that there are 155287 words and 117659 synonym sets included with English
WordNet.
Different methods available with WordNet can be found by typing dir(guru)
['_LazyCorpusLoader__args', '_LazyCorpusLoader__kwargs', '_LazyCorpusLoader__load',
'_LazyCorpusLoader__name', '_LazyCorpusLoader__reader_cls', '__class__', '__delattr__',
'__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__',
'__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', '__unicode__', '__weakref__', '_unload', 'subdir', 'unicode_repr']
Let us understand some of the features available with the wordnet:
Synset: It is also called as synonym set or collection of synonym words. Let us check a example
from nltk.corpus import wordnet
syns = wordnet.synsets("dog")
print(syns)
Output:
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'),
Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'),
Synset('andiron.n.01'), Synset('chase.v.01')]
Lexical Relations: These are semantic relations which are reciprocated. If there is a relationship
between {x1,x2,...xn} and {y1,y2,...yn} then there is also relation between {y1,y2,...yn} and
{x1,x2,...xn}. For example Synonym is the opposite of antonym or hypernyms and hyponym are
type of lexical concept.
Let us write a program using python to find synonym and antonym of word "active" using
Wordnet.
from nltk.corpus import wordnet
synonyms = []
antonyms = []

for syn in wordnet.synsets("active"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))
The output of the code:
{'dynamic', 'fighting', 'combat-ready', 'active_voice', 'active_agent', 'participating', 'alive', 'active'}
-- Synonym
{'stative', 'passive', 'quiet', 'passive_voice', 'extinct', 'dormant', 'inactive'} -- Antonym

Explanation of the code


• Wordnet is a corpus, so it is imported from the ntlk.corpus
• List of both synonym and antonym is taken as empty which will be used for appending
• Synonyms of the word active are searched in the module synsets and are appended in the
list synonyms. The same process is repeated for the second one.
• Output is printed
Conclusion:
WordNet is a lexical database that has been used by a major search engine. From the WordNet,
information about a given word or phrase can be calculated such as
• synonym (words having the same meaning)
• hypernyms (The generic term used to designate a class of specifics (i.e., meal is a
breakfast), hyponyms (rice is a meal)
• holonyms (proteins, carbohydrates are part of meal)
• meronyms (meal is part of daily food intake)
WordNet also provides information on co-ordinate terms, derivates, senses and more. It is used
to find the similarities between any two words. It also holds information on the results of the
related word. In short or nutshell one can treat it as Dictionary or Thesaurus. Going deeper in
wordnet, it is divided into four total subnets such as
• Noun
• Verb
• Adjective
• Adverb
It can be used in the area of artificial intelligence for text analysis. With the help of Wordnet, you
can create your corpus for spelling checking, language translation, Spam detection and many
more.
In the same way, you can use this corpus and mold it to work some dynamic functionality. This
is just like ready to made corpus for you. You can use it in your way.

Counting POS Tags, Frequency Distribution


& Collocations in NLTK
COUNTING POS TAGS
We have discussed various pos_tag in the previous section. In this particular tutorial, you will
study how to count these tags. Counting tags are crucial for text classification as well as
preparing the features for the Natural language-based operations. I will be discussing with you
the approach which guru99 followed while preparing code along with a discussion of output.
Hope this will help you.
How to count Tags:
Here first we will write working code and then we will write different steps to explain the code.
from collections import Counter
import nltk
text = " Guru99 is one of the best sites to learn WEB, SAP, Ethical Hacking
and much more online."
lower_case = text.lower()
tokens = nltk.word_tokenize(lower_case)
tags = nltk.pos_tag(tokens)
counts = Counter( tag for word, tag in tags)
print(counts)
Output:
Counter({'NN': 5, ',': 2, 'TO': 1, 'CC': 1, 'VBZ': 1, 'NNS': 1, 'CD': 1, '.': 1, 'DT': 1, 'JJS': 1, 'JJ': 1,
'JJR': 1, 'IN': 1, 'VB': 1, 'RB': 1})
Elaboration of the code

• To count the tags, you can use the package Counter from the collection's module. A
counter is a dictionary subclass which works on the principle of key-value operation. It is
an unordered collection where elements are stored as a dictionary key while the count is
their value.
• Import nltk which contains modules to tokenize the text.
• Write the text whose pos_tag you want to count.
• Some words are in upper case and some in lower case, so it is appropriate to transform all
the words in the lower case before applying tokenization.
• Pass the words through word_tokenize from nltk.
• Calculate the pos_tag of each token
Output = [('guru99', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'),
('the', 'DT'), ('best', 'JJS'), ('site', 'NN'), ('to', 'TO'), ('learn',
'VB'), ('web', 'NN'), (',', ','), ('sap', 'NN'), (',', ','),
('ethical', 'JJ'), ('hacking', 'NN'), ('and', 'CC'), ('much', 'RB'),
('more', 'JJR'), ('online', 'JJ')]
• Now comes the role of dictionary counter. We have imported in the code line 1. Words
are the key and tags are the value and counter will count each tag total count present in
the text.
Frequency Distribution
Frequency Distribution is referred to as the number of times an outcome of an experiment
occurs. It is used to find the frequency of each word occurring in a document. It uses
FreqDistclass and defined by the nltk.probabilty module.
A frequency distribution is usually created by counting the samples of repeatedly running the
experiment. The no of counts is incremented by one, each time. E.g.
freq_dist = FreqDist()
for the token in the document:
freq_dist.inc(token.type())
For any word, we can check how many times it occurred in a particular document. E.g.
• Count Method: freq_dist.count('and')This expression returns the value of the number of
times 'and' occurred. It is called the count method.
• Frequency Method: freq_dist.freq('and')This the expression returns frequency of a given
sample.
We will write a small program and will explain its working in detail. We will write some text
and will calculate the frequency distribution of each word in the text.
import nltk
a = "Guru99 is the site where you can find the best tutorials for Software
Testing Tutorial, SAP Course for Beginners. Java Tutorial for Beginners
and much more. Please visit the site guru99.com and much more."
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()

Explanation of code:
• Import nltk module.
• Write the text whose word distribution you need to find.
• Tokenize each word in the text which is served as input to FreqDist module of the nltk.
• Apply each word to nlk.FreqDist in the form of a list
• Plot the words in the graph using plot()
Please visualize the graph for a better understanding of the text written

Frequency distribution of each word in the graph


NOTE: You need to have matplotlib installed to see the above graph
Observe the graph above. It corresponds to counting the occurrence of each word in the text. It
helps in the study of text and further in implementing text-based sentimental analysis. In a
nutshell, it can be concluded that nltk has a module for counting the occurrence of each word in
the text which helps in preparing the stats of natural language features. It plays a significant role
in finding the keywords in the text. You can also extract the text from the pdf using libraries like
extract, PyPDF2 and feed the text to nlk.FreqDist.
The key term is "tokenize." After tokenizing, it checks for each word in a given paragraph or text
document to determine that number of times it occurred. You do not need the NLTK toolkit for
this. You can also do it with your own python programming skills. NLTK toolkit only provides a
ready-to-use code for the various operations.
Counting each word may not be much useful. Instead one should focus on collocation and
bigrams which deals with a lot of words in a pair. These pairs identify useful keywords to better
natural language features which can be fed to the machine. Please look below for their details.

Collocations: Bigrams and Trigrams


What is Collocations?
Collocations are the pairs of words occurring together many times in a document. It is calculated
by the number of those pair occurring together to the overall word count of the document.
Consider electromagnetic spectrum with words like ultraviolet rays, infrared rays.
The words ultraviolet and rays are not used individually and hence can be treated as Collocation.
Another example is the CT Scan. We don't say CT and Scan separately, and hence they are also
treated as collocation.
We can say that finding collocations requires calculating the frequencies of words and their
appearance in the context of other words. These specific collections of words require filtering to
retain useful content terms. Each gram of words may then be scored according to some
association measure, to determine the relative likelihood of each Ingram being a collocation.
Collocation can be categorized into two types-
• Bigrams combination of two words
• Trigramscombinationof three words
Bigrams and Trigrams provide more meaningful and useful features for the feature extraction
stage. These are especially useful in text-based sentimental analysis.
Bigrams Example Code
import nltk

text = "Guru99 is a totally new kind of learning experience."


Tokens = nltk.word_tokenize(text)
output = list(nltk.bigrams(Tokens))
print(output)
Output
[('Guru99', 'is'), ('is', 'totally'), ('totally', 'new'), ('new', 'kind'),
('kind', 'of'), ('of', 'learning'), ('learning', 'experience'),
('experience', '.')]

Trigrams Example Code


Sometimes it becomes important to see a pair of three words in the sentence for statistical
analysis and frequency count. This again plays a crucial role in forming NLP (natural language
processing features) as well as text-based sentimental prediction.
The same code is run for calculating the trigrams.
import nltk
text = “Guru99 is a totally new kind of learning experience.”
Tokens = nltk.word_tokenize(text)
output = list(nltk.trigrams(Tokens))
print(output)
Output
[('Guru99', 'is', 'totally'), ('is', 'totally', 'new'), ('totally', 'new',
'kind'), ('new', 'kind', 'of'), ('kind', 'of', 'learning'), ('of',
'learning', 'experience'), ('learning', 'experience', '.')]

You might also like