0% found this document useful (0 votes)
18 views28 pages

Numpy and Pandas

NumPy is a Python library for efficient array manipulation and numerical computations, created in 2005. It offers an array object called ndarray that is significantly faster than traditional Python lists due to its continuous memory storage. The document also covers creating ndarrays, array dimensions, indexing, and basic operations with NumPy and introduces Pandas and Matplotlib for data analysis and visualization.

Uploaded by

devjit207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

Numpy and Pandas

NumPy is a Python library for efficient array manipulation and numerical computations, created in 2005. It offers an array object called ndarray that is significantly faster than traditional Python lists due to its continuous memory storage. The document also covers creating ndarrays, array dimensions, indexing, and basic operations with NumPy and introduces Pandas and Matplotlib for data analysis and visualization.

Uploaded by

devjit207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 28

What is NumPy?

NumPy is a Python library used for working with arrays.It also has functions for working in
domain of linear algebra, Fourier transform, and matrices.NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely.NumPy stands for Numerical
Python.

Why Use NumPy?


In Python we have lists that serve the purpose of arrays, but they are slow to process.NumPy
aims to provide an array object that is up to 50x faster than traditional Python lists.The array
object in NumPy is called ndarray, it provides a lot of supporting functions that make working
with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.

Why is NumPy Faster Than Lists?


NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access
and manipulate them very efficiently.

This behavior is called locality of reference in computer science.

This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest
CPU architectures.

Which Language is NumPy written in?


NumPy is a Python library and is written partially in Python, but most of the parts that require
fast computation are written in C or C++.

import numpy

arr = numpy.array([1, 2, 3, 4, 5])

print(arr)

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.

We can create a NumPy ndarray object by using the array() function.

Example
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.

To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:

Example

Use a tuple to create a NumPy array:

import numpy as np

arr = np.array((1, 2, 3, 4, 5))

print(arr)

Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).

nested array: are arrays that have arrays as their elements.

0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example

Create a 0-D array with value 42

import numpy as np

arr = np.array(42)

print(arr)

1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.

These are the most common and basic arrays.

Example

Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.

These are often used to represent matrix or 2nd order tensors.

NumPy has a whole sub module dedicated towards matrix operations called numpy.mat

Example

Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr)

3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.

These are often used to represent a 3rd order tensor.

Example

Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and
4,5,6:

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)

Check Number of Dimensions?


NumPy Arrays provides the ndim attribute that returns an integer that tells us how many
dimensions the array have.

Example

Check how many dimensions the arrays have:

import numpy as np

a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

Higher Dimensional Arrays


An array can have any number of dimensions.

When the array is created, you can define the number of dimensions by using the ndmin
argument.

Example

Create an array with 5 dimensions and verify that it has 5 dimensions:

import numpy as np

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('number of dimensions :', arr.ndim)

In this array the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the
vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is
3D array and 1st dim has 1 element that is a 4D array.

NumPy Array Indexing

Access Array Elements


Array indexing is the same as accessing an array element.

You can access an array element by referring to its index number.

The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.

Example

Get the first element from the following array:


import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[0])

Example

Get the second element from the following array.

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[1])

Example

Get third and fourth elements from the following array and add them.

import numpy as np

arr = np.array([1, 2, 3, 4])

print(arr[2] + arr[3])

Access 2-D Arrays


To access elements from 2-D arrays we can use comma separated integers representing the
dimension and the index of the element.

Example

Access the 2nd element on 1st dim:

import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st dim: ', arr[0, 1])


Example

Access the 5th element on 2nd dim:

import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('5th element on 2nd dim: ', arr[1, 4])

Access 3-D Arrays


To access elements from 3-D arrays we can use comma separated integers representing the
dimensions and the index of the element.

Example

Access the third element of the second array of the first array:

import numpy as np

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print(arr[0, 1, 2])

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

0 1 0 1

0, 1, 2 0, 1,2 0 , 1 ,2 0 ,1 ,2

print(arr[0, 1, 2])

Example Explained

arr[0, 1, 2] prints the value 6.

And this is why:

The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]

The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]

The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6

Negative Indexing
Use negative indexing to access an array from the end.

Example

Print the last element from the 2nd dim:

import numpy as np

arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('Last element from 2nd dim: ', arr[1, -1])

Slicing arrays
Slicing in python means taking elements from one given index to another given index.

We pass slice instead of index like this: [start:end].

We can also define the step, like this: [start:end:step].

If we don't pass start its considered 0


If we don't pass end its considered length of array in that dimension

If we don't pass step its considered 1

Example

Slice elements from index 1 to index 5 from the following array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[1:5])

Note: The result includes the start index, but excludes the end index.

Example

Slice elements from index 4 to the end of the array:

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7])

print(arr[4:])

Pandas is a Python library.


Pandas is used to analyze data.
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv


Example

Load the CSV into a DataFrame:

import pandas as pd

df = pd.read_csv('desktop/data.csv')

print(df.to_string())

Tip: use to_string() to print the entire DataFrame.

By default, when you print a DataFrame, you will only get the first 5 rows, and the last 5 rows:

Example

Print a reduced sample:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Pandas Read JSON

Read JSON
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

Open data.json.

Example

Load the JSON file into a DataFrame:


import pandas as pd

df = pd.read_json('data.json')

print(df.to_string())

Tip: use to_string() to print the entire DataFrame.

Dictionary as JSON
JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame
directly:

Example

Load a Python Dictionary into a DataFrame:

import pandas as pd

data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}

df = pd.DataFrame(data)

print(df)

Pandas - Analyzing DataFrames

Viewing the Data


One of the most used method for getting a quick overview of the DataFrame, is the head()
method.

The head() method returns the headers and a specified number of rows, starting from the top.

Example

Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

In our examples we will be using a CSV file called 'data.csv'.


Download data.csv, or open data.csv in your browser.

Note: if the number of rows is not specified, the head() method will return the top 5 rows.

Example

Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the
bottom.

Example

Print the last 5 rows of the DataFrame:

print(df.tail())

Info About the Data


The DataFrames object has a method called info(), that gives you more information about the
data set.

Example

Print information about the data:

print(df.info())

Result

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None

Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.

Example

Return a new Data Frame with no empty cells:

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

In our cleaning examples we will be using a CSV file called 'dirtydata.csv'.

Download dirtydata.csv. or Open dirtydata.csv

Note: By default, the dropna() method returns a new DataFrame, and will not change the
original.

If you want to change the original DataFrame, use the inplace = True argument:
Example

Remove all rows with NULL values:

import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())

Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will
remove all rows containg NULL values from the original DataFrame.

Replace Empty Values


Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

Example

Replace NULL values with the number 130:

import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)

What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.

Matplotlib was created by John D. Hunter.

Matplotlib is open source and we can use it freely.


Matplotlib is mostly written in python, a few segments are written in C, Objective-C and
Javascript for Platform compatibility.

import matplotlib.pyplot as plt

Now the Pyplot package can be referred to as plt.

Example

Draw a line in a diagram from position (0,0) to position (6,250):

import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([0, 6])


ypoints = np.array([0, 250])

plt.plot(xpoints, ypoints)
plt.show()

Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:

Example

Use a dotted line:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, linestyle = 'dotted')


plt.show()

Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:
Example

Draw 4 bars:

import matplotlib.pyplot as plt


import numpy as np

x = np.array(["A", "B", "C", "D"])


y = np.array([3, 8, 1, 10])

plt.bar(x,y)
plt.show()

The bar() function takes arguments that describes the layout of the bars.

The categories and their values represented by the first and second argument as arrays.

Example
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)

Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh() function:

Example

Draw 4 horizontal bars:

import matplotlib.pyplot as plt


import numpy as np

x = np.array(["A", "B", "C", "D"])


y = np.array([3, 8, 1, 10])

plt.barh(x, y)
plt.show()

Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.

The scatter() function plots one dot for each observation. It needs two arrays of the same
length, one for the values of the x-axis, and one for values on the y-axis:

Example

A simple scatter plot:

import matplotlib.pyplot as plt


import numpy as np

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y)
plt.show()

Colors
You can set your own color for each scatter plot with the color or the c argument:

Example

Set your own color of the markers:

import matplotlib.pyplot as plt


import numpy as np

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')

x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y, color = '#88c999')

plt.show()
Matplotlib Histograms

Histogram
A histogram is a graph showing frequency distributions.

It is a graph showing the number of observations within each given interval.

Example: Say you ask for the height of 250 people, you might end up with a histogram like this:

You can read from the histogram that there are approximately:

2 people from 140 to 145cm


5 people from 145 to 150cm
15 people from 151 to 156cm
31 people from 157 to 162cm
46 people from 163 to 168cm
53 people from 168 to 173cm
45 people from 173 to 178cm
28 people from 179 to 184cm
21 people from 185 to 190cm
4 people from 190 to 195cm

Create Histogram
In Matplotlib, we use the hist() function to create histograms.

The hist() function will use an array of numbers to create a histogram, the array is sent into the
function as an argument.

For simplicity we use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10. Learn more about Normal Data
Distribution in our Machine Learning Tutorial.

Example

A Normal Data Distribution by NumPy:


import numpy as np

x = np.random.normal(170, 10, 250)

print(x)

The hist() function will read the array and produce a histogram:

import matplotlib.pyplot as plt


import numpy as np

x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

Plotting x and y points


The plot() function is used to draw points (markers) in a diagram.

By default, the plot() function draws a line from point to point.

The function takes parameters for specifying points in the diagram.

Parameter 1 is an array containing the points on the x-axis.

Parameter 2 is an array containing the points on the y-axis.

If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to the
plot function.

Example

Draw a line in a diagram from position (1, 3) to position (8, 10):

import matplotlib.pyplot as plt


import numpy as np

xpoints = np.array([1, 8])


ypoints = np.array([3, 10])

plt.plot(xpoints, ypoints)
plt.show()
Markers
You can use the keyword argument marker to emphasize each point with a specified marker:

Example

Mark each point with a circle:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, marker = 'o')


plt.show()

Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:

import matplotlib.pyplot as plt


import numpy as np

ypoints = np.array([3, 8, 1, 10])

plt.plot(ypoints, linestyle = 'dotted')


plt.show()

Add Grid Lines to a Plot


With Pyplot, you can use the grid() function to add grid lines to the plot.

Example

Add grid lines to the plot:

import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])

plt.title("Sports Watch Data")


plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")

plt.plot(x, y)

plt.grid()

plt.show()

Result:

import numpy as np
import matplotlib.pyplot as plt

x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])

plt.title("Sports Watch Data")


plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")

plt.plot(x, y)

plt.grid(axis = 'y')

plt.show()

Matplotlib Pie Charts

Labels
Add labels to the pie chart with the label parameter.

The label parameter must be an array with one label for each wedge:
Example

A simple pie chart:

import matplotlib.pyplot as plt


import numpy as np

y = np.array([35, 25, 25, 15])


mylabels = ["Apples", "Bananas", "Cherries", "Dates"]

plt.pie(y, labels = mylabels)


plt.show()

Python – seaborn.pairplot() method

 Last Updated : 15 Jul, 2020

Prerequisite: Seaborn Programming Basics

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level


interface for drawing attractive and informative statistical graphics. Seaborn helps resolve the
two major problems faced by Matplotlib; the problems are ?

 Default Matplotlib parameters


 Working with data frames

As Seaborn compliments and extends Matplotlib, the learning curve is quite gradual. If you
know Matplotlib, you are already half way through Seaborn.

seaborn.pairplot() :

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function.
This shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots
and the diagonal plots are the univariate plots.

seaborn.pairplot( data, \*\*kwargs )

Seaborn.pairplot uses many arguments as input, main of which are described below in form of
table:

Arguments Description Value


Tidy (long-
form)
dataframe
where each
data column is a DataFrame
variable and
each row is
an
observation.

Variable in
“data“ to map
hue plot aspects string (variable name), optional
to different
colors.

Set of colors
for mapping
the “hue“
variable. If a
dict, keys
should be
palette values in the dict or seaborn color palette
“hue“
variable.
vars : list of
variable
names,
optional

Variables
within “data“
to use
separately for
the rows and
{x, y}_vars lists of variable names, optional
columns of
the figure; i.e.
to make a
non-square
plot.

dropna Drop missing boolean, optional


values from
the data
before
plotting.

Below is the implementation of above method:

Example 1:
# importing packages

import seaborn

import matplotlib.pyplot as plt

############# Main Section ############

# loading dataset using seaborn

df = seaborn.load_dataset('tips')

# pairplot with hue sex

seaborn.pairplot(df, hue ='sex')

# to show

plt.show()

Output :

Example 2:

# importing packages
import seaborn

import matplotlib.pyplot as plt

############# Main Section ############

# loading dataset using seaborn

df = seaborn.load_dataset('tips')

# pairplot with hue day

seaborn.pairplot(df, hue ='day')

# to show

plt.show()

#.

Output :

A Box Plot is also known as Whisker plot is created to display the summary of the set of data
values having properties like minimum, first quartile, median, third quartile and maximum. In
the box plot, a box is created from the first quartile to the third quartile, a verticle line is also
there which goes through the box at the median. Here x-axis denotes the data to be plotted while
the y-axis shows the frequency distribution.

Creating Box Plot


The matplotlib.pyplot module of matplotlib library provides boxplot() function with the
help of which we can create box plots.

Syntax:

matplotlib.pyplot.boxplot(data, notch=None, vert=None, patch_artist=None, widths=None)

Parameters:
Attribute Value
data array or aequence of array to be plotted
notch optional parameter accepts boolean values
optional parameter accepts boolean values false and true for horizontal and vertical
vert
plot respectively
bootstrap optional parameter accepts int specifies intervals around notched boxplots
optional parameter accepts array or sequnce of array dimension compatible with
usermedians
data
positions optional parameter accepts array and sets the position of boxes
widths optional parameter accepts array and sets the width of boxes
patch_artist optional parameter having boolean values
labels sequence of strings sets label for each dataset
meanline optinal having boolean value try to render meanline as full width of box
zorder optional parameter sets the zorder of the boxplot

The data values given to the ax.boxplot() method can be a Numpy array or Python list or
Tuple of arrays. Let us create the box plot by using numpy.random.normal() to create some
random data, it takes mean, standard deviation, and the desired number of values as arguments.

import matplotlib.pyplot as plt

import numpy as np

# Creating dataset

np.random.seed(10) #The seed() method is used to initialize the random number generator.

data = np.random.normal(100, 20, 200)

fig = plt.figure(figsize =(10, 7))

# Creating plot

plt.boxplot(data)

# show plot

plt.show()

numpy.random.normal(loc , scale , size ) : creates an array of specified shape and fills it


with random values which is actually a part of Normal(Gaussian)Distribution. This is
Distribution is also known as Bell Curve because of its characteristics shape.
Output:

Customizing Box Plot


The matplotlib.pyplot.boxplot() provides endless customization possibilities to the box
plot. The notch = True attribute creates the notch format to the box plot, patch_artist =
True fills the boxplot with colors, we can set different colors to different boxes.The vert = 0
attribute creates horizontal box plot. labels takes same dimensions as the number data sets.

Example 1:

# Import libraries
import matplotlib.pyplot as plt
import numpy as np

# Creating dataset
np.random.seed(10)

data_1 = np.random.normal(100, 10, 200)


data_2 = np.random.normal(90, 20, 200)
data_3 = np.random.normal(80, 30, 200)
data_4 = np.random.normal(70, 40, 200)
data = [data_1, data_2, data_3, data_4]

fig = plt.figure(figsize =(10, 7))

# Creating axes instance


ax = fig.add_axes([0, 0, 1, 1])

# Creating plot
bp = ax.boxplot(data)

# show plot
plt.show()

Output:

You might also like