Numpy and Pandas
Numpy and Pandas
NumPy is a Python library used for working with arrays.It also has functions for working in
domain of linear algebra, Fourier transform, and matrices.NumPy was created in 2005 by Travis
Oliphant. It is an open source project and you can use it freely.NumPy stands for Numerical
Python.
Arrays are very frequently used in data science, where speed and resources are very important.
This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest
CPU architectures.
import numpy
print(arr)
import numpy as np
print(arr)
Create a NumPy ndarray Object
NumPy is used to work with arrays. The array object in NumPy is called ndarray.
Example
import numpy as np
print(arr)
print(type(arr))
type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method,
and it will be converted into an ndarray:
Example
import numpy as np
print(arr)
Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).
0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
Example
import numpy as np
arr = np.array(42)
print(arr)
1-D Arrays
An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array.
Example
import numpy as np
print(arr)
2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
NumPy has a whole sub module dedicated towards matrix operations called numpy.mat
Example
Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)
3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
Example
Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and
4,5,6:
import numpy as np
print(arr)
Example
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
When the array is created, you can define the number of dimensions by using the ndmin
argument.
Example
import numpy as np
print(arr)
print('number of dimensions :', arr.ndim)
In this array the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the
vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is
3D array and 1st dim has 1 element that is a 4D array.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the
second has index 1 etc.
Example
print(arr[0])
Example
import numpy as np
print(arr[1])
Example
Get third and fourth elements from the following array and add them.
import numpy as np
print(arr[2] + arr[3])
Example
import numpy as np
import numpy as np
Example
Access the third element of the second array of the first array:
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
0 1 0 1
0, 1, 2 0, 1,2 0 , 1 ,2 0 ,1 ,2
print(arr[0, 1, 2])
Example Explained
The first number represents the first dimension, which contains two arrays:
[[1, 2, 3], [4, 5, 6]]
and:
[[7, 8, 9], [10, 11, 12]]
Since we selected 0, we are left with the first array:
[[1, 2, 3], [4, 5, 6]]
The second number represents the second dimension, which also contains two arrays:
[1, 2, 3]
and:
[4, 5, 6]
Since we selected 1, we are left with the second array:
[4, 5, 6]
The third number represents the third dimension, which contains three values:
4
5
6
Since we selected 2, we end up with the third value:
6
Negative Indexing
Use negative indexing to access an array from the end.
Example
import numpy as np
Slicing arrays
Slicing in python means taking elements from one given index to another given index.
Example
import numpy as np
print(arr[1:5])
Note: The result includes the start index, but excludes the end index.
Example
import numpy as np
print(arr[4:])
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
import pandas as pd
df = pd.read_csv('desktop/data.csv')
print(df.to_string())
By default, when you print a DataFrame, you will only get the first 5 rows, and the last 5 rows:
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.
Open data.json.
Example
df = pd.read_json('data.json')
print(df.to_string())
Dictionary as JSON
JSON = Python Dictionary
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame
directly:
Example
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
The head() method returns the headers and a specified number of rows, starting from the top.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
Example
print(df.tail())
Example
print(df.info())
Result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
Note: By default, the dropna() method returns a new DataFrame, and will not change the
original.
If you want to change the original DataFrame, use the inplace = True argument:
Example
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but it will
remove all rows containg NULL values from the original DataFrame.
This way you do not have to delete entire rows just because of some empty cells.
Example
import pandas as pd
df = pd.read_csv('data.csv')
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
Example
plt.plot(xpoints, ypoints)
plt.show()
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:
Example
Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:
Example
Draw 4 bars:
plt.bar(x,y)
plt.show()
The bar() function takes arguments that describes the layout of the bars.
The categories and their values represented by the first and second argument as arrays.
Example
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the barh() function:
Example
plt.barh(x, y)
plt.show()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of the same
length, one for the values of the x-axis, and one for values on the y-axis:
Example
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
plt.show()
Colors
You can set your own color for each scatter plot with the color or the c argument:
Example
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y, color = '#88c999')
plt.show()
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions.
Example: Say you ask for the height of 250 people, you might end up with a histogram like this:
You can read from the histogram that there are approximately:
Create Histogram
In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array is sent into the
function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values, where the values
will concentrate around 170, and the standard deviation is 10. Learn more about Normal Data
Distribution in our Machine Learning Tutorial.
Example
print(x)
The hist() function will read the array and produce a histogram:
plt.hist(x)
plt.show()
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to the
plot function.
Example
plt.plot(xpoints, ypoints)
plt.show()
Markers
You can use the keyword argument marker to emphasize each point with a specified marker:
Example
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of the plotted
line:
Example
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.plot(x, y)
plt.grid()
plt.show()
Result:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.plot(x, y)
plt.grid(axis = 'y')
plt.show()
Labels
Add labels to the pie chart with the label parameter.
The label parameter must be an array with one label for each wedge:
Example
As Seaborn compliments and extends Matplotlib, the learning curve is quite gradual. If you
know Matplotlib, you are already half way through Seaborn.
seaborn.pairplot() :
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function.
This shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots
and the diagonal plots are the univariate plots.
Seaborn.pairplot uses many arguments as input, main of which are described below in form of
table:
Variable in
“data“ to map
hue plot aspects string (variable name), optional
to different
colors.
Set of colors
for mapping
the “hue“
variable. If a
dict, keys
should be
palette values in the dict or seaborn color palette
“hue“
variable.
vars : list of
variable
names,
optional
Variables
within “data“
to use
separately for
the rows and
{x, y}_vars lists of variable names, optional
columns of
the figure; i.e.
to make a
non-square
plot.
Example 1:
# importing packages
import seaborn
df = seaborn.load_dataset('tips')
# to show
plt.show()
Output :
Example 2:
# importing packages
import seaborn
df = seaborn.load_dataset('tips')
# to show
plt.show()
#.
Output :
A Box Plot is also known as Whisker plot is created to display the summary of the set of data
values having properties like minimum, first quartile, median, third quartile and maximum. In
the box plot, a box is created from the first quartile to the third quartile, a verticle line is also
there which goes through the box at the median. Here x-axis denotes the data to be plotted while
the y-axis shows the frequency distribution.
Syntax:
Parameters:
Attribute Value
data array or aequence of array to be plotted
notch optional parameter accepts boolean values
optional parameter accepts boolean values false and true for horizontal and vertical
vert
plot respectively
bootstrap optional parameter accepts int specifies intervals around notched boxplots
optional parameter accepts array or sequnce of array dimension compatible with
usermedians
data
positions optional parameter accepts array and sets the position of boxes
widths optional parameter accepts array and sets the width of boxes
patch_artist optional parameter having boolean values
labels sequence of strings sets label for each dataset
meanline optinal having boolean value try to render meanline as full width of box
zorder optional parameter sets the zorder of the boxplot
The data values given to the ax.boxplot() method can be a Numpy array or Python list or
Tuple of arrays. Let us create the box plot by using numpy.random.normal() to create some
random data, it takes mean, standard deviation, and the desired number of values as arguments.
import numpy as np
# Creating dataset
np.random.seed(10) #The seed() method is used to initialize the random number generator.
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
Example 1:
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
# Creating dataset
np.random.seed(10)
# Creating plot
bp = ax.boxplot(data)
# show plot
plt.show()
Output: