How To Install Jupyter Notebook On Ubuntu: Getting Started
How To Install Jupyter Notebook On Ubuntu: Getting Started
How To Install Jupyter Notebook On Ubuntu: Getting Started
Jupyter Notebook is one of the most widely used tool to execute Python interactively directly
from a browser. With Jupyter Notebooks, we have an excellent opportunity to mix code with
interactive exercises and documentation which doesn’t restrict us to keep our comments with #
symbol and also allows to see the output of small snippet of our code directly in our browser.
With the IPython 4.0 release, the language-independent parts of the project: the notebook style,
message protocol, qtconsole, notebook web application, etc. have moved to a new project under
the name Jupyter.
Getting Started
We will start by installing most basic components for this lesson which are Python and PIP.
Let’s start with basic updates in our machine:
sudo apt-get update
Here is what we get back with this command:
Update machine
Next, we can install required components in a single command:
sudo apt-get -y install python2.7 python-pip python-dev
This installation might take some time to install depending on network speed as many
dependencies are being installed here. We are using Python 2.7 with PIP package manager with
which we can use and install many other python modules as we go. Finally, many of Jupyter’s
dependencies work on Python-C extension, so we installed the python-dev dependency as well.
To verify that everything went well, let us check the Python & PIP version with these
commands:
python --version
pip --version
We will get back:
Jupyter Notebook
We can now create a new notebook:
Creating new notebook
We can provide a name to this notebook by clicking on title bar:
Naming a Notebook
Finally, you can write sample Python code and execute in the browser itself:
Here is a short snippet that should generally work:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy
In software, it's said that all abstractions are leaky, and this is true for the Jupyter notebook as it
is for any other software. I most often see this manifest itself with the following issue:
I installed package X and now I can't import it in the notebook. Help!
This issue is a perrennial source of StackOverflow questions (e.g. this, that, here, there, another,
this one, that one, and this... etc.).
Fundamentally the problem is usually rooted in the fact that the Jupyter kernels are
disconnected from Jupyter's shell; in other words, the installer points to a different Python
version than is being used in the notebook. In the simplest contexts this issue does not arise, but
when it does, debugging the problem requires knowledge of the intricacies of the operating
system, the intricacies of Python package installation, and the intricacies of Jupyter itself. In
other words, the Jupyter notebook, like all abstractions, is leaky.
In the wake of several discussions on this topic with colleagues, some online (exhibit A, exhibit
B) and some off, I decided to treat this issue in depth here. This post will address a couple things:
• First, I'll provide a quick, bare-bones answer to the general question, how can I install a
Python package so it works with my jupyter notebook, using pip and/or conda?.
• Second, I'll dive into some of the background of exactly what the Jupyter notebook
abstraction is doing, how it interacts with the complexities of the operating system, and
how you can think about where the "leaks" are, and thus better understand what's
happening when things stop working.
• Third, I'll talk about some ideas the community might consider to help smooth-over
these issues, including some changes that the Jupyter, Pip, and Conda developers might
consider to ease the cognitive load on users.
This post will focus on two approaches to installing Python packages: pip and conda. Other
package managers exist (including platform-specific tools like yum, apt, homebrew, etc., as well
as cross-platform tools like enstaller), but I'm less familiar with them and won't be remarking on
them further.
because the former is more explicit about where the package will be installed (more on this
below).
/usr/bin/python
['',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
darwin',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
mac',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-
mac/lib-scriptpackages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
tk',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
old',
'/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-
dynload', '/Library/Python/2.7/site-packages',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python',
'/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/P
yObjC']
/Users/jakevdp/anaconda/bin/python
['', '/Users/jakevdp/anaconda/lib/python36.zip',
'/Users/jakevdp/anaconda/lib/python3.6',
'/Users/jakevdp/anaconda/lib/python3.6/lib-dynload',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages',
'/Users/jakevdp/anaconda/lib/python3.6/site-packages/schemapi-
0.3.0.dev0+791c7f6-py3.6.egg', '/Users/jakevdp/anaconda/lib/python3.6/site-
packages/setuptools-27.2.0-py3.6.egg']
The full details here are not particularly important, but it is important to emphasize that each
Python executable has its own distinct paths, and unless you modify sys.path (which should
only be done with great care) you cannot import packages installed in a different Python
environment.
When you run pip install or conda install, these commands are associated with a
particular Python version:
• pip installs packages in the Python in its same path
• conda installs packages in the current active conda environment
So, for example we see that pip install will install to the conda environment named
python3.6:
!type pip
pip is /Users/jakevdp/anaconda/envs/python3.6/bin/pip
And conda install will do the same, because python3.6 is the current active environment
(notice the * indicating the active environment):
!conda env list
# conda environments:
#
python2.7 /Users/jakevdp/anaconda/envs/python2.7
python3.5 /Users/jakevdp/anaconda/envs/python3.5
python3.6 * /Users/jakevdp/anaconda/envs/python3.6
rstats /Users/jakevdp/anaconda/envs/rstats
root /Users/jakevdp/anaconda
The reason both pip and conda default to the conda python3.6 environment is that this is the
Python environment I used to launch the notebook.
I'll say this again for emphasis: the shell environment in Jupyter notebook matches the
Python version used to launch the notebook.
How Jupyter executes code: Jupyter Kernels
The next relevant question is how Jupyter chooses to execute Python code, and this brings us to
the concept of a Jupyter Kernel.
A Jupyter kernel is a set of files that point Jupyter to some means of executing code within the
notebook. For Python kernels, this will point to a particular Python version, but Jupyter is
designed to be much more general than this: Jupyter has dozens of available kernels for
languages including Python 2, Python 3, Julia, R, Ruby, Haskell, and even C++ and Fortran!
If you're using the Jupyter notebook, you can change your kernel at any time using the Kernel →
Choose Kernel menu item.
To see the kernels you have available on your system, you can run the following command in the
shell:
!jupyter kernelspec list
Available kernels:
python3 /Users/jakevdp/anaconda/envs/python3.6/lib/python3.6/site-
packages/ipykernel/resources
conda-root /Users/jakevdp/Library/Jupyter/kernels/conda-root
python2.7 /Users/jakevdp/Library/Jupyter/kernels/python2.7
python3.5 /Users/jakevdp/Library/Jupyter/kernels/python3.5
python3.6 /Users/jakevdp/Library/Jupyter/kernels/python3.6
Each of these listed kernels is a directory that contains a file called kernel.json which
specifies, among other things, which language and executable the kernel should use. For
example:
!cat /Users/jakevdp/Library/Jupyter/kernels/conda-root/kernel.json
{
"argv": [
"/Users/jakevdp/anaconda/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "python (conda-root)",
"language": "python"
}
If you'd like to create a new kernel, you can do so using the jupyter ipykernel command; for
example, I created the above kernels for my primary conda environments using the following as
a template:
$ source activate myenv
$ python -m ipykernel install --user --name myenv --display-name "Python
(myenv)"
# this is the critical part, and should be at the end of your script:
exec python -m ipykernel $@
Then in your kernel.json file, modify the argv field to look like this:
"argv": [
"/path/to/kernel-startup.sh",
"-f",
"{connection_file}"
]
Once you do this, switching to the myenv kernel will automatically activate the myenv conda
environment, which changes your $CONDA_PREFIX, $PATH and other system variables such that
!conda install XXX and !pip install XXX will work correctly. A similar approach could
work for virtualenvs or other Python environments.
There is one tricky issue here: this approach will fail if your myenv environment does not have
the ipykernel package installed, and probably also requires it to have a jupyter version
compatible with that used to launch the notebook. So it's not a full solution to the problem by any
means, but if Python kernels could be designed to do this sort of shell initialization by default, it
would be far less confusing to users: !pip install and !conda install would simply work.
Potential Changes to pip
One source of installation confusion, even outside of Jupyter, is the fact that, depending on the
nature of your system's aliases and $PATH variable, pip and python might point to different
paths. In this case pip install will install packages to a path inaccessible to the python
executable. For this reason, it is safer to use python -m pip install, which explicitly specifies
the desired Python version (explicit is better than implicit, after all).
This is one reason that pip install no longer appears in Python's docs, and experienced Python
educators like David Beazley never teach bare pip. CPython developer Nick Coghlan has even
indicated that the pip executable may someday be deprecated in favor of python -m pip. Even
though it's more verbose, I think forcing users to be explicit would be a useful change,
particularly as the use of virtualenvs and conda envs becomes more common.
Changes to Conda
I can think of a couple modifications to conda's API that may be helpful to users
Explicit invocation
For symmetry with pip, it would be nice if python -m conda install could be expected to
work in the same way the pip counterpart does. You can call conda this way in the root
environment, but the conda Python package (as opposed to the conda executable) cannot
currently be installed anywhere but the root environment:
(myenv) jakevdp$ conda install conda
Fetching package metadata ...........
InstallError: Error: 'conda' can only be installed into the root environment
I suspect that allowing python -m conda install in all conda environments would require a
fairly significant redesign of conda's installation model, so it may not be worth the change just
for symmetry with pip's API. That said, such a symmetry would certainly be a help to users.
A pip channel for conda?
Another useful change conda could make would be to add a channel that essentially mirrors the
Python Package Index, so that when you do conda install some-package it will
automatically draw from packages available to pip as well.
I don't have a deep enough knowledge of conda's architecture to know how easy such a feature
would be to implement, but I do have loads of experiences helping newcomers to Python and/or
conda: I can say with certainty that such a feature would go a long way toward softening their
learning curve.
New Jupyter Magic Functions
Even if the above changes to the stack are not possible or desirable, we could simplify the user
experience somewhat by introducing %pip and %conda magic functions within the Jupyter
notebook that detect the current kernel and make certain packages are installed in the correct
location.
pip magic
For example, here's how you can define a %pip magic function that works in the current kernel:
from IPython.core.magic import register_line_magic
@register_line_magic
def pip(args):
"""Use pip from the current kernel"""
from pip import main
main(args.split())
Running it as follows will install packages in the expected location
%pip install numpy
Requirement already satisfied: numpy in
/Users/jakevdp/anaconda/lib/python3.6/site-packages
Note that Jupyter developer Matthias Bussonnier has published essentially this in his pip_magic
repository, so you can do
$ python -m pip install pip_magic
and use this right now (that is, assuming you install pip_magic in the right place!)
conda magic
Similarly, we can define a conda magic that will do the right thing if you type %conda install
XXX. This is a bit more involved than the pip magic, because it must first confirm that the
environment is conda-compatible, and then (related to the lack of python -m conda install)
must call a subprocess to execute the appropriate shell command:
from IPython.core.magic import register_line_magic
import sys
import os
from subprocess import Popen, PIPE
def is_conda_environment():
"""Return True if the current Python executable is in a conda env"""
# TODO: make this work with Conda.exe in Windows
conda_exec = os.path.join(os.path.dirname(sys.executable), 'conda')
conda_history = os.path.join(sys.prefix, 'conda-meta', 'history')
return os.path.exists(conda_exec) and os.path.exists(conda_history)
@register_line_magic
def conda(args):
"""Use conda from the current kernel"""
# TODO: make this work with Conda.exe in Windows
# TODO: fix string encoding to work with Python 2
if not is_conda_environment():
raise ValueError("The python kernel does not appear to be a conda
environment. "
"Please use ``%pip install`` instead.")
# Because the notebook does not allow us to respond "yes" during the
# installation, we need to insert --yes in the argument list for some
commands
if args[1] in ['install', 'update', 'upgrade', 'remove', 'uninstall',
'create']:
if '-y' not in args and '--yes' not in args:
args.insert(2, '--yes')
# Call conda from command line with subprocess & send results to stdout &
stderr
with Popen(args, stdout=PIPE, stderr=PIPE) as process:
# Read stdout character by character, as it includes real-time
progress updates
for c in iter(lambda: process.stdout.read(1), b''):
sys.stdout.write(c.decode(sys.stdout.encoding))
# Read stderr line by line, because real-time does not matter
for line in iter(process.stderr.readline, b''):
sys.stderr.write(line.decode(sys.stderr.encoding))
You can now use %conda install and it will install packages to the correct environment:
%conda install numpy
Fetching package metadata ...........
Solving package specifications: .
Python Modules
In this article, you will learn to create and import custom modules in Python. Also, you will find
different techniques to import and use custom and built-in modules in Python.
Table of Contents
• What are modules in Python?
• How to import modules in Python?
• Python import statement
• Import with renaming
• Python from...import statement
• Import all names
• Python Module Search Path
• Reloading a module
• The dir() built-in function
result = a + b
return result
Here, we have defined a function add() inside a module named example. The function takes in
two numbers and returns their sum.
Powered by DataCamp
Powered by DataCamp
We have renamed the math module as m. This can save us typing time in some cases.
Note that the name math is not recognized in our scope. Hence, math.pi is invalid, m.pi is the
correct implementation.
Python from...import statement
We can import specific names from a module without importing the module as a whole. Here is
an example.
• script.py
• IPython Shell
•
Powered by DataCamp
Powered by DataCamp
We imported all the definitions from the math module. This makes all names except those
beginnig with an underscore, visible in our scope.
Importing everything with the asterisk (*) symbol is not a good programming practice. This can
lead to duplicate definitions for an identifier. It also hampers the readability of our code.
Reloading a module
The Python interpreter imports a module only once during a session. This makes things more
efficient. Here is an example to show how this works.
Suppose we have the following code in a module named my_module.
# This module shows the effect of
# multiple imports and reload
print("This code got executed")
Now we see the effect of multiple imports.
>>> import my_module
This code got executed
>>> import my_module
>>> import my_module
We can see that our code got executed only once. This goes to say that our module was imported
only once.
Now if our module changed during the course of the program, we would have to reload it.One
way to do this is to restart the interpreter. But this does not help much.
Python provides a neat way of doing this. We can use the reload() function inside the imp
module to reload a module. This is how its done.
>>> import imp
>>> import my_module
This code got executed
>>> import my_module
>>> imp.reload(my_module)
This code got executed
<module 'my_module' from '.\\my_module.py'>
Python Package
In this article, you'll learn to divide your code base into clean, efficient modules using Python
packages. Also, you'll learn to import and use your own or third party packagesin your Python
program.
Table of Contents
• What are packages?
• Importing module from a package
In this class, we’ll cover what is Python namespace and why is it needed? We’ll also talk about
what is scope in Python and how namespaces can be used to implement it.
The concept of namespaces is not limited to any particular programming language. C/C++ and
Java also have it where it works as a means to distinguish between different sections of a
program.
The body of a section may consist of a method, or a function, or all the methods of a class. So, a
namespace is a practical approach to define the scope, and it helps to avoid name conflicts.
While in Python, the namespace is a fundamental idea to structure and organize the code,
especially more useful in large projects. However, it could be a bit difficult concept to grasp if
you’re new to programming. Hence, we tried to make namespaces just a little easier to
understand.
foo = function
foo()
You can also assign a name and then reuse it. Check the below example; it is alright for a name
to point to different values.
test = -1
print("type <test> :=", type(test))
test = "Pointing to a string now"
print("type <test> :=", type(test))
test = [0, 1, 1, 2, 3, 5, 8]
print("type <test> :=", type(test))
And here is the output follows.
type <test> := <class 'int'>
type <test> := <class 'str'>
type <test> := <class 'list'>
So, you can see that one name is working perfectly fine to hold data of different types.
You can learn more about types in Python from here – Python data types.
The naming mechanism works inline with Python’s object system, i.e., everything in Python is
an object. All the data types such as numbers, strings, functions, classes are all objects. And a
name acts as a reference to get to the objects.
What are namespaces in Python?
A namespace is a simple system to control the names in a program. It ensures that names are
unique and won’t lead to any conflict.
Also, add to your knowledge that Python implements namespaces in the form of dictionaries. It
maintains a name-to-object mapping where names act as keys and the objects as values. Multiple
namespaces may have the same name but pointing to a different variable. Check out a few
examples of namespaces for more clarity.
Local Namespace
This namespace covers the local names inside a function. Python creates this namespace for
every function called in a program. It remains active until the function returns.
Global Namespace
This namespace covers the names from various imported modules used in a project. Python
creates this namespace for every module included in your program. It’ll last until the program
ends.
Built-in Namespace
This namespace covers the built-in functions and built-in exception names. Python creates it as
the interpreter starts and keeps it until you exit.
What is Scope in Python?
Namespaces make our programs immune from name conflicts. However, it doesn’t give us a free
ride to use a variable name anywhere we want. Python restricts names to be bound by specific
rules known as a scope. The scope determines the parts of the program where you could use that
name without any prefix.
• Python outlines different scopes for locals, function, modules, and built-ins. Check out from the
below list.
• A local scope, also known as the innermost scope, holds the list of all local names available in
the current function.
• A scope for all the enclosing functions, it finds a name from the nearest enclosing scope and
goes outwards.
• A module level scope, it takes care of all the global names from the current module.
• The outermost scope which manages the list of all the built-in names. It is the last place to
search for a name that you cited in the program.
Scope Resolution in Python – Examples
Scope resolution for a given name begins from the inner-most function and then goes higher and
higher until the program finds the related object. If the search ends without any outcome, then
the program throws a NameError exception.
Let’s now see some examples which you can run inside any Python IDE or with IDLE.
a_var = 10
print("begin()-> ", dir())
def foo():
b_var = 11
print("inside foo()-> ", dir())
foo()
outer_foo()
The output is as follows.
['inner_var'] - names in inner_foo
['inner_foo', 'outer_var'] - names in outer_foo
The above example defines two variables and a function inside the scope of outer_foo(). Inside
the inner_foo(), the dir() function only displays one name i.e. “inner_var”. It is alright as the
“inner_var” is the only variable defined in there.
If you reuse a global name inside a local namespace, then Python creates a new local variable
with the same name.
a_var = 5
b_var = 7
def outer_foo():
global a_var
a_var = 3
b_var = 9
def inner_foo():
global a_var
a_var = 4
b_var = 8
print('a_var inside inner_foo :', a_var)
print('b_var inside inner_foo :', b_var)
inner_foo()
print('a_var inside outer_foo :', a_var)
print('b_var inside outer_foo :', b_var)
outer_foo()
print('a_var outside all functions :', a_var)
print('b_var outside all functions :', b_var)
Here goes the output of the above code after execution.
a_var inside inner_foo : 4
b_var inside inner_foo : 8
a_var inside outer_foo : 4
b_var inside outer_foo : 9
a_var outside all functions : 4
b_var outside all functions : 7
We’ve declared a global variable as “a_var” inside both the outer_foo() and inner_foo()
functions. However, we’ve assigned different values in the same global variable. And that’s the
reason the value of “a_var” is same (i.e., 4) on all occasions.
Whereas, each function is creating its own “b_var” variable inside the local scope. And the
print() function is showing the values of this variable as per its local context.
How to correctly import modules in Python?
It is very likely that you would import some of the external modules in your program. So, we’ll
discuss here some of the import strategies, and you can choose the best one.
Import all names from a module
from <module name> import *
It’ll import all the names from a module directly into your working namespace. Since it is an
effortless way, so you might tempt to use this method. However, you may not be able to tell
which module imported a particular function.
Here is an example of using this method.
print("namespace_1: ", dir())
import math
print("namespace_2: ", dir())
print(math.sqrt(144.2))
The output of the above example goes like this.
namespace_1: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__']
namespace_2: ['__builtins__', '__doc__', '__file__', '__loader__',
'__name__', '__package__', '__spec__', 'math']
12.00833044182246
Math in python
Return the ceiling of x as a float, the smallest integer value greater than or equal to x.
math.copysign(x, y)
Return x with the sign of y. On a platform that supports signed zeros, copysign(1.0, -
0.0) returns -1.0.
New in version 2.6.
math.fabs(x)
Return the floor of x as a float, the largest integer value less than or equal to x.
math.fmod(x, y)
Return fmod(x, y), as defined by the platform C library. Note that the Python
expression x % y may not return the same result. The intent of the C standard is that
fmod(x, y) be exactly (mathematically; to infinite precision) equal to x - n*y for some
integer n such that the result has the same sign as x and magnitude less than abs(y).
Python’s x % y returns a result with the sign of y instead, and may not be exactly
computable for float arguments. For example, fmod(-1e-100, 1e100) is -1e-100, but
the result of Python’s -1e-100 % 1e100 is 1e100-1e-100, which cannot be represented
exactly as a float, and rounds to the surprising 1e100. For this reason, function fmod() is
generally preferred when working with floats, while Python’s x % y is preferred when
working with integers.
math.frexp(x)
Return the mantissa and exponent of x as the pair (m, e). m is a float and e is an integer
such that x == m * 2**e exactly. If x is zero, returns (0.0, 0), otherwise 0.5 <=
abs(m) < 1. This is used to “pick apart” the internal representation of a float in a
portable way.
math.fsum(iterable)
Return an accurate floating point sum of values in the iterable. Avoids loss of precision
by tracking multiple intermediate partial sums:
>>> sum([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1])
0.9999999999999999
>>> fsum([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1])
1.0
The algorithm’s accuracy depends on IEEE-754 arithmetic guarantees and the typical
case where the rounding mode is half-even. On some non-Windows builds, the
underlying C library uses extended precision addition and may occasionally double-round
an intermediate sum causing it to be off in its least significant bit.
For further discussion and two alternative approaches, see the ASPN cookbook recipes
for accurate floating point summation.
New in version 2.6.
math.isinf(x)
Check if the float x is a NaN (not a number). For more information on NaNs, see the
IEEE 754 standards.
New in version 2.6.
math.ldexp(x, i)
Return the fractional and integer parts of x. Both results carry the sign of x and are floats.
math.trunc(x)
Return the Real value x truncated to an Integral (usually a long integer). Uses the
__trunc__ method.
New in version 2.6.
Note that frexp() and modf() have a different call/return pattern than their C equivalents: they
take a single argument and return a pair of values, rather than returning their second return value
through an ‘output parameter’ (there is no such thing in Python).
For the ceil(), floor(), and modf() functions, note that all floating-point numbers of
sufficiently large magnitude are exact integers. Python floats typically carry no more than 53 bits
of precision (the same as the platform C double type), in which case any float x with abs(x) >=
2**52 necessarily has no fractional bits.
Return e**x.
math.expm1(x)
Return e**x - 1. For small floats x, the subtraction in exp(x) - 1 can result in a
significant loss of precision; the expm1() function provides a way to compute this
quantity to full precision:
>>> from math import exp, expm1
>>> exp(1e-5) - 1 # gives result accurate to 11 places
1.0000050000069649e-05
>>> expm1(1e-5) # result accurate to full precision
1.0000050000166668e-05
New in version 2.7.
math.log(x[, base])
With one argument, return the natural logarithm of x (to base e).
With two arguments, return the logarithm of x to the given base, calculated as
log(x)/log(base).
Changed in version 2.3: base argument added.
math.log1p(x)
Return the natural logarithm of 1+x (base e). The result is calculated in a way which is
accurate for x near zero.
New in version 2.6.
math.log10(x)
Return the base-10 logarithm of x. This is usually more accurate than log(x, 10).
math.pow(x, y)
Return x raised to the power y. Exceptional cases follow Annex ‘F’ of the C99 standard
as far as possible. In particular, pow(1.0, x) and pow(x, 0.0) always return 1.0, even
when x is a zero or a NaN. If both x and y are finite, x is negative, and y is not an integer
then pow(x, y) is undefined, and raises ValueError.
Unlike the built-in ** operator, math.pow() converts both its arguments to type float.
Use ** or the built-in pow() function for computing exact integer powers.
Changed in version 2.6: The outcome of 1**nan and nan**0 was undefined.
math.sqrt(x)
Return atan(y / x), in radians. The result is between -pi and pi. The vector in the
plane from the origin to point (x, y) makes this angle with the positive X axis. The
point of atan2() is that the signs of both inputs are known to it, so it can compute the
correct quadrant for the angle. For example, atan(1) and atan2(1, 1) are both pi/4,
but atan2(-1, -1) is -3*pi/4.
math.cos(x)
Return the Euclidean norm, sqrt(x*x + y*y). This is the length of the vector from the
origin to point (x, y).
math.sin(x)
Return the natural logarithm of the absolute value of the Gamma function at x.
New in version 2.7.
9.2.7. Constants
math.pi
The Basics¶
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements
(usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy
dimensions are called axes. The number of axes is rank.
For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has
one axis. That axis has a length of 3. In the example pictured below, the array has rank 2 (it is 2-
dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of
3.
[[ 1., 0., 0.],
[ 0., 1., 2.]]
NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array
is not the same as the Standard Python Library class array.array, which only handles one-
dimensional arrays and offers less functionality. The more important attributes of an ndarray
object are:
ndarray.ndim
the number of axes (dimensions) of the array. In the Python world, the number of
dimensions is referred to as rank.
ndarray.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in
each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length
of the shape tuple is therefore the rank, or number of dimensions, ndim.
ndarray.size
the total number of elements of the array. This is equal to the product of the elements of
shape.
ndarray.dtype
an object describing the type of the elements in the array. One can create or specify
dtype’s using standard Python types. Additionally NumPy provides types of its own.
numpy.int32, numpy.int16, and numpy.float64 are some examples.
ndarray.itemsize
the size in bytes of each element of the array. For example, an array of elements of type
float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is
equivalent to ndarray.dtype.itemsize.
ndarray.data
the buffer containing the actual elements of the array. Normally, we won’t need to use
this attribute because we will access the elements in an array using indexing facilities.
An example
>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
'int64'
>>> a.itemsize
8
>>> a.size
15
>>> type(a)
<type 'numpy.ndarray'>
>>> b = np.array([6, 7, 8])
>>> b
array([6, 7, 8])
>>> type(b)
<type 'numpy.ndarray'>
Array Creation
There are several ways to create arrays.
For example, you can create an array from a regular Python list or tuple using the array function.
The type of the resulting array is deduced from the type of the elements in the sequences.
>>> import numpy as np
>>> a = np.array([2,3,4])
>>> a
array([2, 3, 4])
>>> a.dtype
dtype('int64')
>>> b = np.array([1.2, 3.5, 5.1])
>>> b.dtype
dtype('float64')
A frequent error consists in calling array with multiple numeric arguments, rather than providing
a single list of numbers as an argument.
>>> a = np.array(1,2,3,4) # WRONG
>>> a = np.array([1,2,3,4]) # RIGHT
array transforms sequences of sequences into two-dimensional arrays, sequences of sequences of
sequences into three-dimensional arrays, and so on.
>>> b = np.array([(1.5,2,3), (4,5,6)])
>>> b
array([[ 1.5, 2. , 3. ],
[ 4. , 5. , 6. ]])
The type of the array can also be explicitly specified at creation time:
>>> c = np.array( [ [1,2], [3,4] ], dtype=complex )
>>> c
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]])
Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy
offers several functions to create arrays with initial placeholder content. These minimize the
necessity of growing arrays, an expensive operation.
The function zeros creates an array full of zeros, the function ones creates an array full of ones,
and the function empty creates an array whose initial content is random and depends on the state
of the memory. By default, the dtype of the created array is float64.
>>> np.zeros( (3,4) )
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>> np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be
specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]]], dtype=int16)
>>> np.empty( (2,3) ) # uninitialized, output
may vary
array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260],
[ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]])
To create sequences of numbers, NumPy provides a function analogous to range that returns
arrays instead of lists.
>>> np.arange( 10, 30, 5 )
array([10, 15, 20, 25])
>>> np.arange( 0, 2, 0.3 ) # it accepts float arguments
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])
When arange is used with floating point arguments, it is generally not possible to predict the
number of elements obtained, due to the finite floating point precision. For this reason, it is
usually better to use the function linspace that receives as an argument the number of elements
that we want, instead of the step:
>>> from numpy import pi
>>> np.linspace( 0, 2, 9 ) # 9 numbers from 0 to 2
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ])
>>> x = np.linspace( 0, 2*pi, 100 ) # useful to evaluate function at
lots of points
>>> f = np.sin(x)
See also
array, zeros, zeros_like, ones, ones_like, empty, empty_like, arange, linspace,
numpy.random.rand, numpy.random.randn, fromfunction, fromfile
Printing Arrays
When you print an array, NumPy displays it in a similar way to nested lists, but with the
following layout:
• the last axis is printed from left to right,
• the second-to-last is printed from top to bottom,
• the rest are also printed from top to bottom, with each slice separated from the next by an
empty line.
One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals
as lists of matrices.
>>> a = np.arange(6) # 1d array
>>> print(a)
[0 1 2 3 4 5]
>>>
>>> b = np.arange(12).reshape(4,3) # 2d array
>>> print(b)
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
>>>
>>> c = np.arange(24).reshape(2,3,4) # 3d array
>>> print(c)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
See below to get more details on reshape.
If an array is too large to be printed, NumPy automatically skips the central part of the array and
only prints the corners:
>>> print(np.arange(10000))
[ 0 1 2 ..., 9997 9998 9999]
>>>
>>> print(np.arange(10000).reshape(100,100))
[[ 0 1 2 ..., 97 98 99]
[ 100 101 102 ..., 197 198 199]
[ 200 201 202 ..., 297 298 299]
...,
[9700 9701 9702 ..., 9797 9798 9799]
[9800 9801 9802 ..., 9897 9898 9899]
[9900 9901 9902 ..., 9997 9998 9999]]
To disable this behaviour and force NumPy to print the entire array, you can change the printing
options using set_printoptions.
>>> np.set_printoptions(threshold=np.nan)
Basic Operations
Arithmetic operators on arrays apply elementwise. A new array is created and filled with the
result.
>>> a = np.array( [20,30,40,50] )
>>> b = np.arange( 4 )
>>> b
array([0, 1, 2, 3])
>>> c = a-b
>>> c
array([20, 29, 38, 47])
>>> b**2
array([0, 1, 4, 9])
>>> 10*np.sin(a)
array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
>>> a<35
array([ True, True, False, False], dtype=bool)
Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays.
The matrix product can be performed using the dot function or method:
>>> A = np.array( [[1,1],
... [0,1]] )
>>> B = np.array( [[2,0],
... [3,4]] )
>>> A*B # elementwise product
array([[2, 0],
[0, 4]])
>>> A.dot(B) # matrix product
array([[5, 4],
[3, 4]])
>>> np.dot(A, B) # another matrix product
array([[5, 4],
[3, 4]])
Some operations, such as += and *=, act in place to modify an existing array rather than create a
new one.
>>> a = np.ones((2,3), dtype=int)
>>> b = np.random.random((2,3))
>>> a *= 3
>>> a
array([[3, 3, 3],
[3, 3, 3]])
>>> b += a
>>> b
array([[ 3.417022 , 3.72032449, 3.00011437],
[ 3.30233257, 3.14675589, 3.09233859]])
>>> a += b # b is not automatically converted to integer
type
Traceback (most recent call last):
...
TypeError: Cannot cast ufunc add output from dtype('float64') to
dtype('int64') with casting rule 'same_kind'
When operating with arrays of different types, the type of the resulting array corresponds to the
more general or precise one (a behavior known as upcasting).
>>> a = np.ones(3, dtype=np.int32)
>>> b = np.linspace(0,pi,3)
>>> b.dtype.name
'float64'
>>> c = a+b
>>> c
array([ 1. , 2.57079633, 4.14159265])
>>> c.dtype.name
'float64'
>>> d = np.exp(c*1j)
>>> d
array([ 0.54030231+0.84147098j, -0.84147098+0.54030231j,
-0.54030231-0.84147098j])
>>> d.dtype.name
'complex128'
Many unary operations, such as computing the sum of all the elements in the array, are
implemented as methods of the ndarray class.
>>> a = np.random.random((2,3))
>>> a
array([[ 0.18626021, 0.34556073, 0.39676747],
[ 0.53881673, 0.41919451, 0.6852195 ]])
>>> a.sum()
2.5718191614547998
>>> a.min()
0.1862602113776709
>>> a.max()
0.6852195003967595
By default, these operations apply to the array as though it were a list of numbers, regardless of
its shape. However, by specifying the axis parameter you can apply an operation along the
specified axis of an array:
>>> b = np.arange(12).reshape(3,4)
>>> b
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>>
>>> b.sum(axis=0) # sum of each column
array([12, 15, 18, 21])
>>>
>>> b.min(axis=1) # min of each row
array([0, 4, 8])
>>>
>>> b.cumsum(axis=1) # cumulative sum along each row
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38]])
Universal Functions
NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are
called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an
array, producing an array as output.
>>> B = np.arange(3)
>>> B
array([0, 1, 2])
>>> np.exp(B)
array([ 1. , 2.71828183, 7.3890561 ])
>>> np.sqrt(B)
array([ 0. , 1. , 1.41421356])
>>> C = np.array([2., -1., 4.])
>>> np.add(B, C)
array([ 2., 0., 6.])
See also
all, any, apply_along_axis, argmax, argmin, argsort, average, bincount, ceil, clip, conj, corrcoef,
cov, cross, cumprod, cumsum, diff, dot, floor, inner, inv, lexsort, max, maximum, mean, median,
min, minimum, nonzero, outer, prod, re, round, sort, std, sum, trace, transpose, var, vdot,
vectorize, where
Indexing, Slicing and Iterating
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other
Python sequences.
>>> a = np.arange(10)**3
>>> a
array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])
>>> a[2]
8
>>> a[2:5]
array([ 8, 27, 64])
>>> a[:6:2] = -1000 # equivalent to a[0:6:2] = -1000; from start to
position 6, exclusive, set every 2nd element to -1000
>>> a
array([-1000, 1, -1000, 27, -1000, 125, 216, 343, 512, 729])
>>> a[ : :-1] # reversed a
array([ 729, 512, 343, 216, 125, -1000, 27, -1000, 1, -1000])
>>> for i in a:
... print(i**(1/3.))
...
nan
1.0
nan
3.0
nan
5.0
6.0
7.0
8.0
9.0
Multidimensional arrays can have one index per axis. These indices are given in a tuple
separated by commas:
>>> def f(x,y):
... return 10*x+y
...
>>> b = np.fromfunction(f,(5,4),dtype=int)
>>> b
array([[ 0, 1, 2, 3],
[10, 11, 12, 13],
[20, 21, 22, 23],
[30, 31, 32, 33],
[40, 41, 42, 43]])
>>> b[2,3]
23
>>> b[0:5, 1] # each row in the second column of b
array([ 1, 11, 21, 31, 41])
>>> b[ : ,1] # equivalent to the previous example
array([ 1, 11, 21, 31, 41])
>>> b[1:3, : ] # each column in the second and third row
of b
array([[10, 11, 12, 13],
[20, 21, 22, 23]])
When fewer indices are provided than the number of axes, the missing indices are considered
complete slices:
>>> b[-1] # the last row. Equivalent to b[-
1,:]
array([40, 41, 42, 43])
The expression within brackets in b[i] is treated as an i followed by as many instances of : as
needed to represent the remaining axes. NumPy also allows you to write this using dots as b[i,...].
The dots (...) represent as many colons as needed to produce a complete indexing tuple. For
example, if x is a rank 5 array (i.e., it has 5 axes), then
• x[1,2,...] is equivalent to x[1,2,:,:,:],
• x[...,3] to x[:,:,:,:,3] and
• x[4,...,5,:] to x[4,:,:,5,:].
>>> c = np.array( [[[ 0, 1, 2], # a 3D array (two stacked 2D
arrays)
... [ 10, 12, 13]],
... [[100,101,102],
... [110,112,113]]])
>>> c.shape
(2, 2, 3)
>>> c[1,...] # same as c[1,:,:] or c[1]
array([[100, 101, 102],
[110, 112, 113]])
>>> c[...,2] # same as c[:,:,2]
array([[ 2, 13],
[102, 113]])
Iterating over multidimensional arrays is done with respect to the first axis:
>>> for row in b:
... print(row)
...
[0 1 2 3]
[10 11 12 13]
[20 21 22 23]
[30 31 32 33]
[40 41 42 43]
However, if one wants to perform an operation on each element in the array, one can use the flat
attribute which is an iterator over all the elements of the array:
>>> for element in b.flat:
... print(element)
...
0
1
2
3
10
11
12
13
20
21
22
23
30
31
32
33
40
41
42
43
See also
Indexing, Indexing (reference), newaxis, ndenumerate, indices
Shape Manipulation
Changing the shape of an array
An array has a shape given by the number of elements along each axis:
>>> a = np.floor(10*np.random.random((3,4)))
>>> a
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
>>> a.shape
(3, 4)
The shape of an array can be changed with various commands. Note that the following three
commands all return a modified array, but do not change the original array:
>>> a.ravel() # returns the array, flattened
array([ 2., 8., 0., 6., 4., 5., 1., 1., 8., 9., 3., 6.])
>>> a.reshape(6,2) # returns the array with a modified shape
array([[ 2., 8.],
[ 0., 6.],
[ 4., 5.],
[ 1., 1.],
[ 8., 9.],
[ 3., 6.]])
>>> a.T # returns the array, transposed
array([[ 2., 4., 8.],
[ 8., 5., 9.],
[ 0., 1., 3.],
[ 6., 1., 6.]])
>>> a.T.shape
(4, 3)
>>> a.shape
(3, 4)
The order of the elements in the array resulting from ravel() is normally “C-style”, that is, the
rightmost index “changes the fastest”, so the element after a[0,0] is a[0,1]. If the array is
reshaped to some other shape, again the array is treated as “C-style”. NumPy normally creates
arrays stored in this order, so ravel() will usually not need to copy its argument, but if the array
was made by taking slices of another array or created with unusual options, it may need to be
copied. The functions ravel() and reshape() can also be instructed, using an optional argument, to
use FORTRAN-style arrays, in which the leftmost index changes the fastest.
The reshape function returns its argument with a modified shape, whereas the ndarray.resize
method modifies the array itself:
>>> a
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
>>> a.resize((2,6))
>>> a
array([[ 2., 8., 0., 6., 4., 5.],
[ 1., 1., 8., 9., 3., 6.]])
If a dimension is given as -1 in a reshaping operation, the other dimensions are automatically
calculated:
>>> a.reshape(3,-1)
array([[ 2., 8., 0., 6.],
[ 4., 5., 1., 1.],
[ 8., 9., 3., 6.]])
See also
ndarray.shape, reshape, resize, ravel
Stacking together different arrays
Several arrays can be stacked together along different axes:
>>> a = np.floor(10*np.random.random((2,2)))
>>> a
array([[ 8., 8.],
[ 0., 0.]])
>>> b = np.floor(10*np.random.random((2,2)))
>>> b
array([[ 1., 8.],
[ 0., 4.]])
>>> np.vstack((a,b))
array([[ 8., 8.],
[ 0., 0.],
[ 1., 8.],
[ 0., 4.]])
>>> np.hstack((a,b))
array([[ 8., 8., 1., 8.],
[ 0., 0., 0., 4.]])
The function column_stack stacks 1D arrays as columns into a 2D array. It is equivalent to
hstack only for 2D arrays:
>>> from numpy import newaxis
>>> np.column_stack((a,b)) # with 2D arrays
array([[ 8., 8., 1., 8.],
[ 0., 0., 0., 4.]])
>>> a = np.array([4.,2.])
>>> b = np.array([3.,8.])
>>> np.column_stack((a,b)) # returns a 2D array
array([[ 4., 3.],
[ 2., 8.]])
>>> np.hstack((a,b)) # the result is different
array([ 4., 2., 3., 8.])
>>> a[:,newaxis] # this allows to have a 2D columns vector
array([[ 4.],
[ 2.]])
>>> np.column_stack((a[:,newaxis],b[:,newaxis]))
array([[ 4., 3.],
[ 2., 8.]])
>>> np.hstack((a[:,newaxis],b[:,newaxis])) # the result is the same
array([[ 4., 3.],
[ 2., 8.]])
On the other hand, the function row_stack is equivalent to vstack for any input arrays. In general,
for arrays of with more than two dimensions, hstack stacks along their second axes, vstack stacks
along their first axes, and concatenate allows for an optional arguments giving the number of the
axis along which the concatenation should happen.
Note
In complex cases, r_ and c_ are useful for creating arrays by stacking numbers along one axis.
They allow the use of range literals (”:”)
>>> np.r_[1:4,0,4]
array([1, 2, 3, 0, 4])
When used with arrays as arguments, r_ and c_ are similar to vstack and hstack in their default
behavior, but allow for an optional argument giving the number of the axis along which to
concatenate.
See also
hstack, vstack, column_stack, concatenate, c_, r_
Splitting one array into several smaller ones
Using hsplit, you can split an array along its horizontal axis, either by specifying the number of
equally shaped arrays to return, or by specifying the columns after which the division should
occur:
>>> a = np.floor(10*np.random.random((2,12)))
>>> a
array([[ 9., 5., 6., 3., 6., 8., 0., 7., 9., 7., 2., 7.],
[ 1., 4., 9., 2., 2., 1., 0., 6., 2., 2., 4., 0.]])
>>> np.hsplit(a,3) # Split a into 3
[array([[ 9., 5., 6., 3.],
[ 1., 4., 9., 2.]]), array([[ 6., 8., 0., 7.],
[ 2., 1., 0., 6.]]), array([[ 9., 7., 2., 7.],
[ 2., 2., 4., 0.]])]
>>> np.hsplit(a,(3,4)) # Split a after the third and the fourth column
[array([[ 9., 5., 6.],
[ 1., 4., 9.]]), array([[ 3.],
[ 2.]]), array([[ 6., 8., 0., 7., 9., 7., 2., 7.],
[ 2., 1., 0., 6., 2., 2., 4., 0.]])]
vsplit splits along the vertical axis, and array_split allows one to specify along which axis to
split.
Numpy VS SciPy
Numpy:
• Numpy is written in C and use for mathematical or numeric calculation.
• It is faster than other Python Libraries
• Numpy is the most useful library for Data Science to perform basic calculations.
• Numpy contains nothing but array data type which performs the most basic operation like
sorting, shaping, indexing, etc.
SciPy:
• SciPy is built in top of the NumPy
• SciPy is a fully-featured version of Linear Algebra while Numpy contains only a few
features.
• Most new Data Science features are available in Scipy rather than Numpy.
DESCRIPTION
========================================
Special functions (:mod:`scipy.special`)
========================================
.. module:: scipy.special
Nearly all of the functions below are universal functions and follow
broadcasting and automatic array-looping rules. Exceptions are noted.
Bessel Function
Nth integer order calculation function
Syntax :
scipy.special.jn()
You can see this. Frequency is 5 Hz and its signal repeats in 1/5 seconds – it's call as a particular
time period.
Now let us use this sinusoid wave with the help of DFT application.
from scipy import fftpack
A = fftpack.fft(a)
frequency = fftpack.fftfreq(len(a)) * fre_samp
figure, axis = plt.subplots()
axis.stem(frequency, np.abs(A))
axis.set_xlabel('Frequency in Hz')
axis.set_ylabel('Frequency Spectrum Magnitude')
axis.set_xlim(-fre_samp / 2, fre_samp/ 2)
axis.set_ylim(-5, 110)
plt.show()
Output:
• You can clearly see that output is a one-dimensional array.
• Input containing complex values are zero except two points.
• In DFT example we visualize the magnitude of the signal.
def function(a):
return a*2 + 20 * np.sin(a)
plt.plot(a, function(a))
plt.show()
#use BFGS algorithm for optimization
optimize.fmin_bfgs(function, 0)
Output:
Optimization terminated successfully.
Current function value: -23.241676
Iterations: 4
Function evaluations: 18
Gradient evaluations: 6
array([-1.67096375])
• In this example, optimization is done with the help of the gradient descent algorithm from
the initial point
• But the possible issue is local minima instead of global minima. If we don't find a
neighbor of global minima, then we need to apply global optimization and find global
minima function used as basinhopping() which combines local optimizer.
optimize.basinhopping(function, 0)
Output:
fun: -23.241676238045315
lowest_optimization_result:
fun: -23.241676238045315
hess_inv: array([[0.05023331]])
jac: array([4.76837158e-07])
message: 'Optimization terminated successfully.'
nfev: 15
nit: 3
njev: 5
status: 0
success: True
x: array([-1.67096375])
message: ['requested number of basinhopping iterations
completed successfully']
minimization_failures: 0
nfev: 1530
nit: 100
njev: 510
x: array([-1.67096375])
Summary
• SciPy(pronounced as "Sigh Pi") is an Open Source Python-based library, which is used in
mathematics, scientific computing, Engineering, and technical computing.
• SciPy contains varieties of sub packages which help to solve the most common issue
related to Scientific Computation.
• SciPy is built in top of the NumPy
Package Name Description
scipy.io • File input/output
scipy.special • Special Function
scipy.linalg • Linear Algebra Operation
scipy.interpolate • Interpolation
scipy.optimize • Optimization and fit
scipy.stats • Statistics and random numbers
scipy.integrate • Numerical Integration
scipy.fftpack • Fast Fourier transforms
scipy.signal • Signal Processing
scipy.ndimage • Image manipulation –
As you can see, it is a challenging task: after all, the images are of poor resolution. Do you agree
with the classifier?
A complete example of this classification problem is available as an example that you can run
and study: Recognizing hand-written digits.
Model persistence
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle:
>>>
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC(gamma='scale')
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Conventions
scikit-learn estimators follow certain rules to make their behavior more predictive. These are
described in more detail in the Glossary of Common Terms and API Elements.
Type casting
Unless otherwise specified, input will be cast to float64:
>>>
>>> import numpy as np
>>> from sklearn import random_projection
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']
Here, the first predict() returns an integer array, since iris.target (an integer array) was
used in fit. The second predict() returns a string array, since iris.target_names was for
fitting.
Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the
set_params() method. Calling fit() more than once will overwrite what was learned by any
previous fit():
>>>
>>> import numpy as np
>>> from sklearn.svm import SVC
>>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
>>> y = [0, 0, 1, 1, 2]
History of NLP
Here, is are important events in the history of Natural Language Processing:
1950- NLP started when Alan Turing published an article called "Machine and Intelligence."
1950- Attempts to automate translation between Russian and English
1960- The work of Chomsky and others on formal language theory and generative syntax
1990- Probabilistic and data-driven models had become quite standard
2000- A Large amount of spoken and textual data become available
The answer is we learn this thinks through experience. However, here the main question is that
how computer know about the same?
We need to provide enough data for Machines to learn through experience. We can feed details
like
• Her Majesty the Queen.
• The Queen's speech during the State visit
• The crown of Queen Elizabeth
• The Queens's Mother
• The queen is generous.
With above examples the machine understands the entity Queen.
The machine creates word vectors as below. A word vector is built using surrounding words.
Components of NLP
Five main Component of Natural Language processing are:
• Morphological and Lexical Analysis
• Syntactic Analysis
• Semantic Analysis
• Discourse Integration
• Pragmatic Analysis
Morphological and Lexical Analysis
Lexical analysis is a vocabulary that includes its words and expressions. It depicts analyzing,
identifying and description of the structure of words. It includes dividing a text into paragraphs,
words and the sentences
Individual words are analyzed into their components, and nonword tokens such as punctuations
are separated from the words.
Semantic Analysis
Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings. This
component transfers linear sequences of words into structures. It shows how the words are
associated with each other.
Semantics focuses only on the literal meaning of words, phrases, and sentences. This only
abstracts the dictionary meaning or the real meaning from the given context. The structures
assigned by the syntactic analyzer always have assigned meaning
E.g.. "colorless green idea." This would be rejected by the Symantec analysis as colorless Here;
green doesn't make any sense.
Pragmatic Analysis
Pragmatic Analysis deals with the overall communicative and social content and its effect on
interpretation. It means abstracting or deriving the meaningful use of language in situations. In
this analysis, the main focus always on what was said in reinterpreted on what is meant.
Pragmatic analysis helps users to discover this intended effect by applying a set of rules that
characterize cooperative dialogues.
E.g., "close the window?" should be interpreted as a request instead of an order.
Syntax analysis
The words are commonly accepted as being the smallest units of syntax. The syntax refers to the
principles and rules that govern the sentence structure of any individual languages.
Syntax focus about the proper ordering of words which can affect its meaning. This involves
analysis of the words in a sentence by following the grammatical structure of the sentence. The
words are transformed into the structure to show hows the word are related to each other.
Discourse Integration
It means a sense of the context. The meaning of any single sentence which depends upon that
sentences. It also considers the meaning of the following sentence.
For example, the word "that" in the sentence "He wanted that" depends upon the prior discourse
context.
NLP Examples
Today, Natual process learning technology is widely used technology.
Here, are common Application' of NLP:
Information retrieval & Web Search
Google, Yahoo, Bing, and other search engines base their machine translation technology on
NLP deep learning models. It allows algorithms to read text on a webpage, interpret its meaning
and translate it to another language.
Grammar Correction:
NLP technique is widely used by word processor software like MS-word for spelling correction
& grammar checking.
Question Answering
Type in keywords to ask Questions in Natural Language.
Text Summarization
The process of summarising important information from a source to produce a shortened version
Machine Translation
Use of computer applications to translate text or speech from one natural language to another.
Sentiment analysis
NLP helps companies to analyze a large number of reviews on a product. It also allows their
customers to give a review of the particular product.
Future of NLP
• Human readable natural language processing is the biggest Al- problem. It is all most
same as solving the central artificial intelligence problem and making computers as
intelligent as people.
• Future computers or machines with the help of NLP will able to learn from the
information online and apply that in the real world, however, lots of work need to on this
regard.
• Combined with natural language generation, computers will become more capable of
receiving and giving useful and resourceful information or data.
Advantages of NLP
• Users can ask questions about any subject and get a direct response within seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or unwanted
information
• The accuracy of the answers increases with the amount of relevant information provided
in the question.
• NLP process helps computers communicate with humans in their language and scales
other language-related tasks
• Allows you to perform more language-based data compares to a human being without
fatigue and in an unbiased and consistent way.
• Structuring a highly unstructured data source
Disadvantages of NLP
• Complex Query Language- the system may not be able to provide the correct answer it
the question that is poorly worded or ambiguous.
• The system is built for a single and specific task only; it is unable to adapt to new
domains and problems because of limited functions.
• NLP system doesn't have a user interface which lacks features that allow users to further
interact with the system
Summary
• Natural Language Processing is a branch of AI which helps computers to understand,
interpret and manipulate human language
• NLP started when Alan Turing published an article called "Machine and Intelligence".
• NLP never focuses on voice modulation; it does draw on contextual patterns
• Five essential components of Natural Language processing are 1) Morphological and
Lexical Analysis 2)Syntactic Analysis 3) Semantic Analysis 4) Discourse Integration 5)
Pragmatic Analysis
• Three types of the Natural process writing system are 1)Logographic 2) Syllabic 3)
Alphabetic
• Machine learning and Statistical inference are two methods to implement Natural Process
Learning
• Essential Applications of NLP are Information retrieval & Web Search, Grammar
Correction Question Answering, , Text Summarization, Machine Translation, etc.
• Future computers or machines with the help of NLP will able to learn from the
information online and apply that in the real world, however, lots of work need to on this
regard
• NLP is are ambiguous while computer language is designed to unambiguous
• The biggest advantage of the NLP system is that it offers exact answers to the questions,
no unnecessary or unwanted information
• The biggest draw back of the NLP system is built for a single and specific task only so it
is unable to adapt to new domains and problems because of limited functions
Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']
Code Explanation
• word_tokenize module is imported from the NLTK library.
• A variable "text" is initialized with two sentences.
• Text variable is passed in word_tokenize module and printed the result. This module
breaks each word with punctuation which you can see in the output.
Tokenization of Sentences
Sub-module available for the above is sent_tokenize. An obvious question in your mind would
be why sentence tokenization is needed when we have the option of word tokenization.
Imagine you need to count average words per sentence, how you will calculate? For
accomplishing such a task, you need both sentence tokenization as well as words to calculate the
ratio. Such output serves as an important feature for machine training as the answer would be
numeric.
Check the below example to learn how sentence tokenization is different from words
tokenization.
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))
POS Tagging
Parts of speech Tagging is responsible for reading the text in a language and assigning some
specific token (Parts of Speech) to each word.
e.g.
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved:
• Tokenize text (word_tokenize)
• apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Some examples are as below:
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
JJ adjective (large)
LS list market
RP particle (about)
UH interjection (goodbye)
VB verb (ask)
POS tagger is used to assign grammatical information of each word of the sentence. Installing,
Importing and downloading all the packages of NLTK is complete.
Chunking
Chunking is used to add more structure to the sentence by following parts of speech (POS)
tagging. It is also known as shallow parsing. The resulted group of words is called "chunks." In
shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow Parsing is also called light parsing or chunking.
The primary usage of chunking is to make a group of "noun phrases." The parts of speech are
combined with regular expressions.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from
the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Following table shows what the various symbol means:
Name of symbol Description
? Match 0 or 1 repetitions
Code Explanation:
• There is a stem module in NLTk which is imported. If ifyou import the complete module,
then the program becomes heavy as it contains thousands of lines of codes. So from the
entire stem module, we only imported "PorterStemmer."
• We prepared a dummy list of variation data of the same word.
• An object is created which belongs to class nltk.stem.porter.PorterStemmer.
• Further, we passed it to PorterStemmer one by one using "for" loop. Finally, we got
output root word of each word mentioned in the list.
From the above explanation, it can also be concluded that stemming is considered as an
important preprocessing step because it removed redundancy in the data and variations in the
same word. As a result, data is filtered which will help in better machine training.
Now we pass a complete sentence and check for its behavior as an output.
Program:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello Guru99, You have to build a very good site and I love
visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)
Output:
hello
guru99
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site
Code Explanation
• Package PorterStemer is imported from module stem
• Packages for tokenization of sentence as well as words are imported
• A sentence is written which is to be tokenized in the next step.
• Word tokenization is implemented in this step.
• An object for PorterStemmer is created here.
• Loop is run and stemming of each word is done using the object created in the code line 5
Conclusion:
Stemming is a data-preprocessing module. The English language has many variations of a single
word. These variations create ambiguity in machine learning training and prediction. To create a
successful model, it's vital to filter such words and convert to the same type of sequenced data
using stemming. Also, this is an important technique to get row data from a set of sentence and
removal of redundant data also known as normalization.
What is Lemmatization?
Lemmatization is the algorithmic process of finding the lemma of a word depending on their
meaning. Lemmatization usually refers to the morphological analysis of words, which aims to
remove inflectional endings. It helps in returning the base or dictionary form of a word, which is
known as the lemma. The NLTK Lemmatization method is based on WorldNet's built-in morph
function. Text preprocessing includes both stemming as well as lemmatization. Many people find
the two terms confusing. Some treat these as same, but there is a difference between these both.
Lemmatization is preferred over the former because of the below reason.
print(set(synonyms))
print(set(antonyms))
The output of the code:
{'dynamic', 'fighting', 'combat-ready', 'active_voice', 'active_agent', 'participating', 'alive', 'active'}
-- Synonym
{'stative', 'passive', 'quiet', 'passive_voice', 'extinct', 'dormant', 'inactive'} -- Antonym
• To count the tags, you can use the package Counter from the collection's module. A
counter is a dictionary subclass which works on the principle of key-value operation. It is
an unordered collection where elements are stored as a dictionary key while the count is
their value.
• Import nltk which contains modules to tokenize the text.
• Write the text whose pos_tag you want to count.
• Some words are in upper case and some in lower case, so it is appropriate to transform all
the words in the lower case before applying tokenization.
• Pass the words through word_tokenize from nltk.
• Calculate the pos_tag of each token
Output = [('guru99', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'),
('the', 'DT'), ('best', 'JJS'), ('site', 'NN'), ('to', 'TO'), ('learn',
'VB'), ('web', 'NN'), (',', ','), ('sap', 'NN'), (',', ','),
('ethical', 'JJ'), ('hacking', 'NN'), ('and', 'CC'), ('much', 'RB'),
('more', 'JJR'), ('online', 'JJ')]
• Now comes the role of dictionary counter. We have imported in the code line 1. Words
are the key and tags are the value and counter will count each tag total count present in
the text.
Frequency Distribution
Frequency Distribution is referred to as the number of times an outcome of an experiment
occurs. It is used to find the frequency of each word occurring in a document. It uses
FreqDistclass and defined by the nltk.probabilty module.
A frequency distribution is usually created by counting the samples of repeatedly running the
experiment. The no of counts is incremented by one, each time. E.g.
freq_dist = FreqDist()
for the token in the document:
freq_dist.inc(token.type())
For any word, we can check how many times it occurred in a particular document. E.g.
• Count Method: freq_dist.count('and')This expression returns the value of the number of
times 'and' occurred. It is called the count method.
• Frequency Method: freq_dist.freq('and')This the expression returns frequency of a given
sample.
We will write a small program and will explain its working in detail. We will write some text
and will calculate the frequency distribution of each word in the text.
import nltk
a = "Guru99 is the site where you can find the best tutorials for Software
Testing Tutorial, SAP Course for Beginners. Java Tutorial for Beginners
and much more. Please visit the site guru99.com and much more."
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()
Explanation of code:
• Import nltk module.
• Write the text whose word distribution you need to find.
• Tokenize each word in the text which is served as input to FreqDist module of the nltk.
• Apply each word to nlk.FreqDist in the form of a list
• Plot the words in the graph using plot()
Please visualize the graph for a better understanding of the text written