Python NumPy - Course Introduction
Python is emerging as one of the favorite tools in the field of data science. With
powerful data science libraries like NumPy, SciPy,
pandas, matplotlib, scikit-learn and tools like IPython notebook combined with ease
of programming, Python is proving to be the preferred language for organizations.
This course will introduce you to some of these libraries useful for data science.
You will further take a deep dig on playing with NumPy.
Data Science
Data Science is an interdisciplinary area that extracts insights from data, present
in multiple forms.
To master the field of Data Science, one must possess knowledge on all of the
following fields.
Computer Science
Artificial Intelligence and Machine Learning
Statistics and Mathematics
Domain Knowledge
Knowledge Discovery Process
Knowledge Discovery Process extracts meaningful insights from rawdata. It involves
the following series of steps.
Problem Definition
Data Collection
Data Preprocessing
Data Transformation
Data Mining
Data Analysis
Data Visualization
Python provides many powerful libraries that can be used to perform various tasks
described above.
Python libraries for Data Science.
NumPy
An essential library used for scientific computing in Python.
Holds data in N-dimensional array (ndarray) objects, which can store data in
multiple dimensions.
Supports performing efficient array operations through Broadcasting feature.
pandas
Provides functionality to deal with structured data.
Stores Data in different Primary data structures: Series, DataFrame and Panel.
matplotlib
Widely used for Data Visualization.
Used to generate various types of plots.
SciPy
A collection of efficient numerical algorithms used in Numerical integration,
Signal processing and Optimization.
NLTK
Performs different tasks related to Natural Language Processing.
scikit-learn
Python library used for Machine learning
Jupyter
Provides web based interactive computational environment.
Combines code, rich text, plots, media and mathematical equations together.
Bokeh
Offers interactive Web visualization features.
PyMongo
PyMongo distribution comprises tools for working with MongoDB.
MongoDB is a highly scalable and robust NoSQL database.
Scientific Distributions
Data scientist has to manually install all the python libraries required for
performing various tasks involved in Knowledge Discovery Process.
Drawbacks of Manual Installation
Installing few libraries may require an installation of other dependencies.
Time-consuming task.
Installation of few libraries may be unsuccessful.
Prone to manual errors.
Scientific Distribution.
All draw backs of manual installation could be overcome using any one of the
available Scientific Distributions.
A Scientific distribution is a collection of Python libraries, which provide a
ready to use Python environment.
A Scientific distribution is easy to download, install and use.
Few popular distributions include Anaconda, Enthought Python, PythonXY, WinPython.
In this course, you will learn about Anaconda.
Anaconda
Anaconda is a popular high-performance platform used for data science.
The base version is open source and contains over 100+ packages from Python, R, and
Scala.
Additionally, provides access to over 700+ packages that could be installed and
managed using conda.
Anaconda is available for 32-bit and 64-bit Operating systems: Windows, Linux, and
Mac OSX.
Installing Anaconda
Steps for installing Anaconda
Identify your system's OS and its architecture, i.e., 32-bit or 64-bit.
Go to Anaconda's downloads page.
Select the download section of your OS.
Choose the Python Version, i.e 3.x or 2.x, based on your interest.
Download the installer based on your system architecture.
Optional: Verify data integrity with MD5 or SHA-256.
Install the downloaded file.
Anaconda Navigator
Provides access to various components of Anaconda Distribution.
The following windows appear at the left side of Anaconda Navigator.
Home
Environment
Projects
Learning
Community
Home and Environment Windows
Home Window
Opened by default with root environment.
Enables launching working environment through various modes like Jupyter Notebooks,
Jupyter qt-console, and Sypder IDE.
Environment Window
Shows information about various available environments.
Details of packages installed for each available environment is viewable.
Projects, Learning, and Community Windows
Project Window
Provides tools for managing Anaconda projects.
Learning Window
Provides access to popular Data Science Resources.
Community Window
Provides links to popular Data Science Events, Forums, Blogs, etc.
Anaconda Prompt is the command line tool provided by Anaconda Distribution.
You can access anaconda's default Python interactive interpreter, using command
'python'.
You can also work with Conda, anaconda's package manager.
Command for checking Conda's version.
conda --version
Command for viewing available environments.
conda info --envs
Creating a new environment
By default anaconda comes with root environment.
A new environment testenv, with Python 2.7, can be created using the below command.
conda create --name testenv python=2.7
Command for activating testenv
activate testenv
Command for viewing available packages in testenv.
conda list
Accessing numpy package from current testenv results in ImportError.
You can install the package using conda install.
conda install numpy
Now you can verify the numpy availability with conda list command.
After successful installation, you can access numpy from testenv, without any
errors.
IPython
IPython provides interactive working environment, which is highly convenient and
efficient.
Its major components are:
An interactive Python shell.
A Jupyter kernel that allows working with Python code in various interactive front
ends.
Features of IPython
Python statements and System commands can be executed in IPython.
IPython supports Tab completion feature.
With Magic Methods, IPython enables performing many tasks easily.
IPython caches Input and Output history.
IPython supports Parallel Computing.
Launching Jupyter qt-console
The GIF illustrates the following:
How to open IPython in Jupyter qt-console from Anaconda Navigator.
How to execute Python statements in IPython?
How to run System commands in IPython?
Knowing about an object or a method.
Using Tab completion feature.
Understanding Magic Methods
Magic Methods begin with a single % or double %% symbols.
Line Magic Method: Magic method starting with one % symbol.
Line Magic Method is applicable only on a single line of code.
Cell Magic Method: Magic method starting with two %% symbols.
Cell Magic Method is applicable on multiple lines of code, written in a single
cell.
Starting Jupyter Notebook Server
Jupyter Notebook server can be launched from Anaconda Navigator Home Window. The
Notebook server opens in a browser and displays contents of starting folder.
The displayed page contains the following three tabs.
Files displays folders and files present in starting folder.
Running holds information of notebooks that are running.
Clusters contain information of notebooks running in parallel mode.
Creating a Folder
A folder can be created using Folder option present under New section.
The GIF illustrates the following.
Creating an Untitled folder.
Renaming it to MyJupyterNoteBooks, and
Changing working directory to MyJupyterNoteBooks folder.
Starting a Jupyter Notebook
A Jupyter Notebook can be created by Choosing an available Kernel.
The Kernel enables the environment required for executing the code snippets.
The GIF illustrates
Creation of Untitled Notebook.
Renaming it to MyFirstNoteBook.
Checking it's running status in Files / Running tabs.
Shutting down the notebook MyFirstNoteBook.
About a Notebook Cell
The basic element of a Notebook is Cell.
A user is allowed to write either code snippets or markdown text, inside a cell.
A Markdown Text can be used to embed Normal text, Header Text, Unordered, Ordered
Lists, Hyperlinks, Tables, Images,
Videos, HTML content, and other useful elements inside the Notebook.
Markdown Basics
In this section, you will be writing the following elements in Markdown.
Headers : Continuous 1 to 6 Hash Symbols are used to create Headers.
Emphasizing Text : Asterix *, or underscores _ are used to emphasize the text in
bold or italic.
Markdown Basics
Unordered Lists : Either of the symbols - Asterix *, hypen -, plus + are used.
Ordered Lists : Numbers followed with a dot . and
a space are used.
Nested Unordered Lists : The nested lists are indexed with a minimum of four spaces
and followed with symbols.
Justifying Text of a list element : Two spaces, at the end of each line, are used
to justify multiple lines of text.
Code snippets: Pair of three back quotes are used.
Hyperlinks: Text, written in a pair of square brackets, is linked to a Hyperlink,
specified in a pair of parenthesis.
Reference Links: Text and Reference both are written in two different pairs of
square brackets.
HTML Content : HTML tags can be directly used in Markdown.
Writing Your First Notebook
The above-shown GIF performs the following tasks in the notebook - MyFirstNoteBook.
Defines the string 's' with value Welcome to Jupyter Notebooks!!!.
Displays the string 's'.
Provides the required description.
The above GIF illustrates performing the following, additional tasks in
MyFirstNoteBook.
Determines the length of 's'.
Obtains the slice Jupyter Notebooks from 's'.
Find the number of vowels in 's'.
Filter the words starting with either 'J' or 'N'.
Provides titles as required.
Numpy
NumPy is a Python library, which supports efficient handling of various numerical
operations on arrays holding numeric data.
These arrays are known as N-dimensional arrays or ndarrays.
Ndarrays are capable of holding data elements in multiple dimensions.
Each data element of a ndarray is of fixed size.
All elements of a ndarray are of same data type.
N-dimensional array (ndarray)
N-dimensional array is an object, capable of holding data elements of same type and
of a fixed size in multiple dimensions.
Creation of a 1-D array of five elements, from a list is shown in Example 1.
Example 1
import numpy as np
x = np.array([5, 8,
9, 10,
11]) # using 'array' method
type(x) # Displays type of array 'x'
Output
numpy.ndarray
N-dimensional array (ndarray)...
Creation of a 2-D array from a list of lists is shown in Example 2.
Example 2
y = np.array([[6, 9, 5],
[10, 82, 34]])
print(y)
Output
array([[ 6, 9, 5],
[10, 82, 34]])
ndarray Attributes
Some of the important attributes of a ndarray are
ndim : Returns number of dimensions.
shape: Returns Shape in tuple.
size : Total number of elements.
dtype : Type of each element.
itemsize : Size of each element in Bytes.
nbytes : Total bytes consumed by all elements.
Example 3
print(y.ndim, y.shape, y.size, y.dtype, y.itemsize, y.nbytes)
Output
2 (2, 3) 6 int32 4 24
Numpy dtypes
Numpy supports various data types based on number of bytes required by the data
elements.
Data type can be explicitly specified with dtype argument.
A ndarray, holding float values is defined in Example 4.
Example 4
y = np.array([[6, 9, 5],
[10, 82, 34]],
dtype='float64')
print(y)
print(y.dtype)
Output
array([[ 6., 9., 5.],
[ 10., 82., 34.]])
float64
++++
def array_operations(l):
#Write your code below
x = np.array(l)
print(type(x),
print(x.ndim, x.shape, x.size))
Numpy Array creation
N-dimensional arrays or ndarray can be created in multiple ways in numpy.
Now let us focus on creating ndarray,
From Python built-in datatypes : lists or tuples
Using Numpy array creation methods like ones, ones_like, zeros, zeros_like
Using Numpy numeric sequence generators.
Using Numpy random module.
By reading data from a file.