Numpy, Pandas and Matplotlib
Numpy, Pandas and Matplotlib
for numerical and mathematical operations. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of high-level
mathematical functions to operate on these arrays. NumPy is a fundamental
package for scientific computing in Python and is widely used in various fields
such as data science, machine learning, signal processing, and more.
NumPy is a Python library used for working with arrays.It also has functions for
working in the domain of linear algebra, fourier transform, and matrices.NumPy
was created in 2005 by Travis Oliphant.
Installation:
You can install NumPy using the following command if you haven't already:
Copy code
pip install numpy
Importing NumPy:
Once installed, you can import NumPy in your Python script or Jupyter
notebook:
Copy code
import numpy as np
The np alias is a common convention used for NumPy.
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)
Array Operations:
NumPy supportsHigher Dimensional Arrays
An array can have any number of dimensions.
When the array is created, you can define the number of dimensions by using
the ndmin argument. element-wise operations on arrays:
Copy code
# Element-wise addition
result = arr1 + 10
# Element-wise multiplication
result = arr2 * 2
Copy code
# Accessing elements
element = arr1[2]
#Get third and fourth elements from the following array and add them.
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])
Negative Indexing
Use negative indexing to access an array from the end.
#Print the last element from the 2nd dim:
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
NumPy Array Slicing
Slicing arrays
Slicing in python means taking elements from one given index to another given
index.
We pass a slice instead of an index like this: [start:end].
We can also define the step, like this: [start:end:step].
#Slice from the index 3 from the end to index 1 from the end:
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
#From both elements, slice index 1 to index 4 (not included), this will
return a 2-D array:
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])
# Slicing
subarray = arr2[:2, 1:3]
Below is a list of all data types in NumPy and the characters used to represent
them.
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
Change data type from float to integer by using 'i' as parameter value:
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype('i')
print(newarr)
print(newarr.dtype)
#Change data type from float to integer by using int as parameter value:
arr = np.array([1.1, 2.1, 3.1])
newarr = arr.astype(int)
print(newarr)
print(newarr.dtype)
Create an array with 5 dimensions using ndmin using a vector with values
1,2,3,4 and verify that last dimension has value 4:
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('shape of array :', arr.shape)
Search Sorted
There is a method called searchsorted() which performs a binary search in the
array, and returns the index where the specified value would be inserted to
maintain the search order.
The searchsorted() method is assumed to be used on sorted arrays.
Find the indexes where the value 7 should be inserted:
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7)
print(x)
Example explained: The number 7 should be inserted on index 1 to remain the
sort order.
The method starts the search from the left and returns the first index where the
number 7 is no longer larger than the next value.
Find the indexes where the value 7 should be inserted, starting from the right:
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7, side='right')
print(x)
Example explained: The number 7 should be inserted on index 2 to remain the
sort order.
The method starts the search from the right and returns the first index where the
number 7 is no longer less than the next value.
Multiple Values
To search for more than one value, use an array with the specified values.
Find the indexes where the values 2, 4, and 6 should be inserted:
arr = np.array([1, 3, 5, 7])
x = np.searchsorted(arr, [2, 4, 6])
print(x)
The return value is an array: [1 2 3] containing the three indexes where 2, 4, 6
would be inserted in the original array to maintain the order.
Create a filter array that will return only values higher than 42:
arr = np.array([41, 42, 43, 44])
# Create an empty list
filter_arr = []
# go through each element in arr
for element in arr:
# if the element is higher than 42, set the value to True, otherwise False:
if element > 42:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Create a filter array that will return only even elements from the original
array:
arr = np.array([1, 2, 3, 4, 5, 6, 7])
# Create an empty list
filter_arr = []
# go through each element in arr
for element in arr:
# if the element is completely divisble by 2, set the value to True, otherwise
False
if element % 2 == 0:
filter_arr.append(True)
else:
filter_arr.append(False)
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Create a filter array that will return only values higher than 42:
arr = np.array([41, 42, 43, 44])
filter_arr = arr > 42
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Create a filter array that will return only even elements from the original
array:
arr = np.array([1, 2, 3, 4, 5, 6, 7])
filter_arr = arr % 2 == 0
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)
Mathematical Functions:
NumPy provides a variety of mathematical functions:
Copy code
# Sum, mean, and standard deviation
total = np.sum(arr1)
average = np.mean(arr2)
std_dev = np.std(arr1)
Pandas Introduction
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
Installation:
You can install Pandas using the following command:
pip install pandas
Importing Pandas:
Import Pandas
Once Pandas is installed, import it in your applications by adding the import
keyword:
import pandas
Now Pandas is imported and ready to use.
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Once installed, you can import Pandas in your Python script or Jupyter
notebook:
Pandas is usually imported under the pd alias.
alias: In Python alias are an alternate name for referring to the same thing.
import pandas as pd
Now the Pandas package can be referred to as pd instead of pandas.
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
Creating a DataFrame:
The DataFrame is one of the central data structures in Pandas. It is a two-
dimensional table with labeled axes (rows and columns). You can create a
DataFrame from various data sources, such as lists, dictionaries, or external files
(CSV, Excel, SQL, etc.).
Basic Operations:
Viewing Data:
# Display the first few rows
print(df.head())
Indexing and Selection:
# Selecting a column
print(df['Name'])
# Selecting rows by index
print(df.loc[0])
# Filtering data
print(df[df['Age'] > 25])
Descriptive Statistics:
# Summary statistics
print(df.describe())
Data Cleaning:
Handling Missing Data:
# Drop rows with missing values
df.dropna()
Removing Duplicates:
# Remove duplicate rows
df.drop_duplicates()
Data Manipulation:
Adding Columns:
# Create a new column
df['Salary'] = [50000, 60000, 45000]
Applying Functions:
# Apply a function to a column
df['Age'] = df['Age'].apply(lambda x: x + 1)
Example
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the
pd.options.display.max_rows statement.
Example
Check the number of maximum returned rows:
import pandas as pd
print(pd.options.display.max_rows)
In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and the
first and last 5 rows.
You can change the maximum rows number with the same statement.
Example
Increase the maximum number of rows to display the entire DataFrame:
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
The head() method returns the headers and a specified number of rows, starting
from the top.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting
from the bottom.
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
Info About the Data
The DataFrames object has a method called info(), that gives you more
information about the data set.
Example
Print information about the data:
print(df.info())Pandas - Cleaning Data
Data Cleaning
Data cleaning means fixing bad data in your data set.
Empty cells
Data in wrong format
Wrong data
Duplicates
In this tutorial you will learn how to deal with all of them.
The data set contains some empty cells ("Date" in row 22, and "Calories" in row
18 and 28).
The data set contains the wrong format ("Date" in row 26).
Example
Remove all rows with NULL values
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
Note: Now, the dropna(inplace = True) will NOT return a new DataFrame, but
it will remove all rows containing NULL values from the original DataFrame.
Example
Replace NULL values with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace = True)
Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for the
DataFrame:
Example
Replace NULL values in the "Calories" columns with the number 130:
import pandas as pd
df = pd.read_csv('data.csv')
Example
Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].median()
df["Calories"].fillna(x, inplace = True)
Median = the value in the middle, after you have sorted all values ascending.
Example
Calculate the MODE, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)
To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.
Let's try to convert all cells in the 'Date' column into dates.
As you can see from the result, the date in row 26 was fixed, but the empty date
in row 22 got a NaT (Not a Time) value, in other words an empty value. One
way to deal with empty values is simply removing the entire row.
Sometimes you can spot wrong data by looking at the data set, because you
have an expectation of what it should be.
If you take a look at our data set, you can see that in row 7, the duration is 450,
but for all the other rows the duration is between 30 and 60.
It doesn't have to be wrong, but taking in consideration that this is the data set of
someone's workout sessions, we conclude with the fact that this person did not
work out in 450 minutes.
Replacing Values
One way to fix wrong values is to replace them with something else.
In our example, it is most likely a typo, and the value should be "45" instead of
"450", and we could just insert "45" in row 7:
Set "Duration" = 45 in row 7:
df.loc[7, 'Duration'] = 45
For small data sets you might be able to replace the wrong data one by one, but
not for big data sets.
To replace wrong data for larger data sets you can create some rules, e.g. set
some boundaries for legal values, and replace any values that are outside of the
boundaries.
Example
Loop through all values in the "Duration" column.
If the value is higher than 120, set it to 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong
data.
This way you do not have to find out what to replace them with, and there is a
good chance you do not need them to do your analyses.
Example
Delete rows where "Duration" is higher than 120:
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace = True)
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
4 45 '2020/12/05' 117 148 406.0
5 60 '2020/12/06' 102 127 300.0
6 60 '2020/12/07' 110 136 374.0
7 450 '2020/12/08' 104 134 253.3
8 30 '2020/12/09' 109 133 195.1
9 60 '2020/12/10' 98 124 269.0
10 60 '2020/12/11' 103 147 329.3
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7
13 60 '2020/12/13' 106 128 345.3
14 60 '2020/12/14' 104 132 379.3
15 60 '2020/12/15' 98 123 275.0
16 60 '2020/12/16' 98 120 215.2
17 60 '2020/12/17' 100 120 300.0
18 45 '2020/12/18' 90 112 NaN
19 60 '2020/12/19' 103 123 323.0
20 45 '2020/12/20' 97 125 243.0
21 60 '2020/12/21' 108 131 364.2
22 45 NaN 100 119 282.0
23 60 '2020/12/23' 130 101 300.0
24 45 '2020/12/24' 105 132 246.0
25 60 '2020/12/25' 102 126 334.5
26 60 20201226 100 120 250.0
27 60 '2020/12/27' 92 118 241.0
28 60 '2020/12/28' 103 132 NaN
29 60 '2020/12/29' 100 132 280.0
30 60 '2020/12/30' 102 129 380.3
31 60 '2020/12/31' 92 115 243.0
By taking a look at our test data set, we can assume that row 11 and 12 are
duplicates.
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
Example
Remove all duplicates:
df.drop_duplicates(inplace = True)
Result Explained
The Result of the corr() method is a table with a lot of numbers that represents
how well the relationship is between two columns.
0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value, the
other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will.
What is a good correlation? It depends on the use, but I think it is safe to say
you have to have at least 0.6 (or -0.6) to call it a good correlation.
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more calories
you burn, and the other way around: if you burned a lot of calories, you
probably had a long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at the
duration of the work out, and vice versa.
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.
Matplotlib was created by John D. Hunter.
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
plt.plot(xpoints, ypoints)
plt.show()
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8,
10):
Multiple Points
You can plot as many points as you like, just make sure you have the same
number of points in both axis.
Example
Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally
to position (8, 10):
plt.plot(xpoints, ypoints)
plt.show()
Default X-Points
If we do not specify the points on the x-axis, they will get the default values 0,
1, 2, 3 etc., depending on the length of the y-points.
So, if we take the same example as above, and leave out the x-points, the
diagram will look like this:
Example
Plotting without x-points:
plt.plot(ypoints)
plt.show()
Matplotlib Markers
Markers
You can use the keyword argument marker to emphasize each point with a
specified marker:
Marker Description
'o' Circle
'*' Star
'.' Point
',' Pixel
'x' X
'X' X (filled)
'+' Plus
'P' Plus (filled)
's' Square
'D' Diamond
'd' Diamond (thin)
'p' Pentagon
'H' Hexagon
'h' Hexagon
'v' Triangle Down
'^' Triangle Up
'<' Triangle Left
'>' Triangle Right
'1' Tri Down
'2' Tri Up
'3' Tri Left
'4' Tri Right
'|' Vline
'_' Hline
This parameter is also called fmt, and is written with this syntax:
Marker|line|color
Example
Mark each point with a circle:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(ypoints, 'o:r')
plt.show()
The marker value can be anything from the Marker Reference above.
Note: If you leave out the line value in the fmt parameter, no line will be
plotted.
The short color value can be one of the following:
Color Reference
Color Syntax Description
'r' Red
'g' Green
'b' Blue
'c' Cyan
'm' Magenta
'y' Yellow
'k' Black
'w' White
Marker Size
You can use the keyword argument markersize or the shorter version, ms to set
the size of the markers:
Example
Set the size of the markers to 20:
Marker Color
You can use the keyword argument markeredgecolor or the shorter mec to set
the color of the edge of the markers:
Example
Set the EDGE color to red:
Use both the mec and mfc arguments to color the entire marker:
Example
Set the color of both the edge and the face to red:
Example
Mark each point with the color named "hotpink":
...
plt.plot(ypoints, marker = 'o', ms = 20, mec = 'hotpink', mfc = 'hotpink')
...
Matplotlib Line
Linestyle
You can use the keyword argument linestyle, or shorter ls, to change the style of
the plotted line:
Use a dotted line:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, linestyle = 'dotted')
plt.show()
Shorter Syntax
The line style can be written in a shorter syntax:
linestyle can be written as ls.
dotted can be written as :.
dashed can be written as --.
Example
Shorter syntax:
plt.plot(ypoints, ls = ':')
Line Styles
You can choose any of these styles:
Style Or
'solid' (default) '-'
'dotted' ':'
'dashed' '--'
'dashdot' '-.'
'None''' or ' '
Line Color
You can use the keyword argument color or the shorter c to set the color of the
line:
Example
Set the line color to red:
import matplotlib.pyplot as plt
import numpy as np
ypoints = np.array([3, 8, 1, 10])
plt.plot(ypoints, color = 'r')
plt.show()
Line Width
You can use the keyword argument linewidth or the shorter lw to change the
width of the line.
Example
Plot with a 20.5pt wide line:
Example
Draw two lines by specifying a plt.plot() function for each line:
import matplotlib.pyplot as plt
import numpy as np
y1 = np.array([3, 8, 1, 10])
y2 = np.array([6, 2, 7, 11])
plt.plot(y1)
plt.plot(y2)
plt.show()
You can also plot many lines by adding the points for the x- and y-axis for each
line in the same plt.plot() function.
(In the examples above we only specified the points on the y-axis, meaning that
the points on the x-axis got the the default values (0, 1, 2, 3).)
The x- and y- values come in pairs:
Example
Draw two lines by specifiyng the x- and y-point values for both lines:
Example
Add a plot title and labels for the x- and y-axis:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.plot(x, y)
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.show()
Example
Set font properties for the title and labels:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}
plt.title("Sports Watch Data", fontdict = font1)
plt.xlabel("Average Pulse", fontdict = font2)
plt.ylabel("Calorie Burnage", fontdict = font2)
plt.plot(x, y)
plt.show()
Position the Title
You can use the loc parameter in title() to position the title.
Legal values are: 'left', 'right', and 'center'. Default value is 'center'.
Example
Position the title to the left:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data", loc = 'left')
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid()
plt.show()
Legal values are: 'x', 'y', and 'both'. Default value is 'both'.
Example
Display only grid lines for the x-axis:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid(axis = 'x')
plt.show()
Example
Set the line properties of the grid:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
plt.title("Sports Watch Data")
plt.xlabel("Average Pulse")
plt.ylabel("Calorie Burnage")
plt.plot(x, y)
plt.grid(color = 'green', linestyle = '--', linewidth = 0.5)
plt.show()
Matplotlib Subplot
Display Multiple Plots
With the subplot() function you can draw multiple plots in one figure:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.show()
The subplot() Function
The subplot() function takes three arguments that describes the layout of the
figure.
The layout is organized in rows and columns, which are represented by the first
and second argument.
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.
plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the second plot.
So, if we want a figure with 2 rows an 1 column (meaning that the two plots
will be displayed on top of each other instead of side-by-side), we can write the
syntax like this:
Example
Draw 2 plots on top of each other:
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 1, 2)
plt.plot(x,y)
plt.show()
You can draw as many plots you like on one figure, just descibe the number of
rows, columns, and the index of the plot.
Example
Draw 6 plots:
plt.subplot(2, 3, 4)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 5)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 3, 6)
plt.plot(x,y)
plt.show()
Title
You can add a title to each plot with the title() function:
Example
2 plots, with titles:
import matplotlib.pyplot as plt
import numpy as np
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")
plt.show()
Super Title
You can add a title to the entire figure with the suptitle() function:
Example
Add a title for the entire figure:
#plot 1:
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(1, 2, 1)
plt.plot(x,y)
plt.title("SALES")
#plot 2:
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(1, 2, 2)
plt.plot(x,y)
plt.title("INCOME")
plt.suptitle("MY SHOP")
plt.show()
Matplotlib Scatter
Creating Scatter Plots
With Pyplot, you can use the scatter() function to draw a scatter plot.
The scatter() function plots one dot for each observation. It needs two arrays of
the same length, one for the values of the x-axis, and one for values on the y-
axis:
plt.scatter(x, y)
plt.show()
The observation in the example above is the result of 13 cars passing by.
Compare Plots
In the example above, there seems to be a relationship between speed and age,
but what if we plot the observations from another day as well? Will the scatter
plot tell us something else?
Example
Draw two plots on the same figure:
plt.show()
Colors
You can set your own color for each scatter plot with the color or the c
argument:
Example
Set your own color of the markers:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color = 'hotpink')
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y, color = '#88c999')
plt.show()
Example
Set your own color of the markers:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors =
np.array(["red","green","blue","yellow","pink","black","orange","purple","beig
e","brown","gray","cyan","magenta"])
plt.scatter(x, y, c=colors)
plt.show()
ColorMap
The Matplotlib module has a number of available colormaps.
A colormap is like a list of colors, where each color has a value that ranges from
0 to 100.
Here is an example of a colormap:
This colormap is called 'viridis' and as you can see it ranges from 0, which is a
purple color, up to 100, which is a yellow color.
In addition you have to create an array with values (from 0 to 100), one value
for each point in the scatter plot:
Example
Create a color array, and specify a colormap in the scatter plot:
You can include the colormap in the drawing by including the plt.colorbar()
statement:
Example
Include the actual colormap:
import matplotlib.pyplot as plt
import numpy as np
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
colors = np.array([0, 10, 20, 30, 40, 45, 50, 55, 60, 70, 80, 90, 100])
plt.scatter(x, y, c=colors, cmap='viridis')
plt.colorbar()
plt.show()
Available ColorMaps
You can choose any of the built-in colormaps:
Name Reverse
Accent Accent_r
Blues Blues_r
BrBG BrBG_r
BuGn BuGn_r
BuPu BuPu_r
CMRmap CMRmap_r
Dark2 Dark2_r
GnBu GnBu_r
Greens Greens_r
Greys Greys_r
OrRd OrRd_r
Oranges Oranges_r
PRGn PRGn_r
Paired Paired_r
Pastel1 Pastel1_r
Pastel2 Pastel2_r
PiYG PiYG_r
PuBu PuBu_r
PuBuGn PuBuGn_r
PuOr PuOr_r
PuRd PuRd_r
Purples Purples_r
RdBu RdBu_r
RdGy RdGy_r
RdPu RdPu_r
RdYlBu RdYlBu_r
RdYlGn RdYlGn_r
Reds Reds_r
Set1 Set1_r
Set2 Set2_r
Set3 Set3_r
Spectral Spectral_r
Wistia Wistia_r
YlGn YlGn_r
YlGnBu YlGnBu_r
YlOrBr YlOrBr_r
YlOrRd YlOrRd_r
afmhot afmhot_r
autumn autumn_r
binary binary_r
bone bone_r
brg brg_r
bwr bwr_r
cividis cividis_r
cool cool_r
coolwarm coolwarm_r
copper copper_r
cubehelix cubehelix_r
flag flag_r
gist_earth gist_earth_r
gist_gray gist_gray_r
gist_heat gist_heat_r
gist_ncar gist_ncar_r
gist_rainbow gist_rainbow_r
gist_stern gist_stern_r
gist_yarg gist_yarg_r
gnuplot gnuplot_r
gnuplot2 gnuplot2_r
gray gray_r
hot hot_r
hsv hsv_r
inferno inferno_r
jet jet_r
magma magma_r
nipy_spectral nipy_spectral_r
ocean ocean_r
pink pink_r
plasma plasma_r
prism prism_r
rainbow rainbow_r
seismic seismic_r
spring spring_r
summer summer_r
tab10 tab10_r
tab20 tab20_r
tab20b tab20b_r
tab20c tab20c_r
terrain terrain_r
twilight twilight_r
twilight_shifted twilight_shifted_r
viridis viridis_r
winter winter_r
Size
You can change the size of the dots with the s argument.
Just like colors, make sure the array for sizes has the same length as the arrays
for the x- and y-axis:
Example
Set your own size for the markers:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes = np.array([20,50,100,200,500,1000,60,90,10,300,600,800,75])
plt.scatter(x, y, s=sizes)
plt.show()
Alpha
You can adjust the transparency of the dots with the alpha argument.
Just like colors, make sure the array for sizes has the same length as the arrays
for the x- and y-axis:
Example
Set your own size for the markers:
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
sizes = np.array([20,50,100,200,500,1000,60,90,10,300,600,800,75])
plt.show()
Example
Create random arrays with 100 values for x-points, y-points, colors and sizes:
x = np.random.randint(100, size=(100))
y = np.random.randint(100, size=(100))
colors = np.random.randint(100, size=(100))
sizes = 10 * np.random.randint(100, size=(100))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.5, cmap='nipy_spectral')
plt.colorbar()
plt.show()
Matplotlib Bars
Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:
The bar() function takes arguments that describes the layout of the bars.
The categories and their values represented by the first and second argument as
arrays.
Example
x = ["APPLES", "BANANAS"]
y = [400, 350]
plt.bar(x, y)
Horizontal Bars
If you want the bars to be displayed horizontally instead of vertically, use the
barh() function:
Example
Draw 4 horizontal bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.barh(x, y)
plt.show()
Bar Color
The bar() and barh() take the keyword argument color to set the color of the
bars:
Example
Draw 4 red bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x, y, color = "red")
plt.show()
Color Names
You can use any of the 140 supported color names.
Example
Draw 4 "hot pink" bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x, y, color = "hotpink")
plt.show()
Color Hex
Or you can use Hexadecimal color values:
Example
Draw 4 bars with a beautiful green color:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x, y, color = "#4CAF50")
plt.show()
Bar Width
The bar() takes the keyword argument width to set the width of the bars:
Example
Draw 4 very thin bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x, y, width = 0.1)
plt.show()
Bar Height
The barh() takes the keyword argument height to set the height of the bars:
Example
Draw 4 very thin bars:
import matplotlib.pyplot as plt
import numpy as np
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.barh(x, y, height = 0.1)
plt.show()
Matplotlib Histograms
Histogram
A histogram is a graph showing frequency distributions.
Example: Say you ask for the height of 250 people, you might end up with a
histogram like this:
You can read from the histogram that there are approximately:
2 people from 140 to 145cm
5 people from 145 to 150cm
15 people from 151 to 156cm
31 people from 157 to 162cm
46 people from 163 to 168cm
53 people from 168 to 173cm
45 people from 173 to 178cm
28 people from 179 to 184cm
21 people from 185 to 190cm
4 people from 190 to 195cm
Create Histogram
In Matplotlib, we use the hist() function to create histograms.
The hist() function will use an array of numbers to create a histogram, the array
is sent into the function as an argument.
For simplicity we use NumPy to randomly generate an array with 250 values,
where the values will concentrate around 170, and the standard deviation is 10.
Learn more about Normal Data Distribution in our Machine Learning Tutorial.
The hist() function will read the array and produce a histogram:
Example
A simple histogram:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(170, 10, 250)
plt.hist(x)
plt.show()
Labels
Add labels to the pie chart with the labels parameter.
The labels parameter must be an array with one label for each wedge:
Example
A simple pie chart:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
plt.pie(y, labels = mylabels)
plt.show()
Start Angle
As mentioned the default start angle is at the x-axis, but you can change the start
angle by specifying a startangle parameter.
Example
Start the first wedge at 90 degrees:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
plt.pie(y, labels = mylabels, startangle = 90)
plt.show()
Explode
Maybe you want one of the wedges to stand out? The explode parameter allows
you to do that.
The explode parameter, if specified, and not None, must be an array with one
value for each wedge.
Each value represents how far from the center each wedge is displayed:
Example
Pull the "Apples" wedge 0.2 from the center of the pie:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
myexplode = [0.2, 0, 0, 0]
plt.pie(y, labels = mylabels, explode = myexplode)
plt.show()
Shadow
Add a shadow to the pie chart by setting the shadows parameter to True:
Example
Add a shadow:
Colors
You can set the color of each wedge with the colors parameter.
The colors parameter, if specified, must be an array with one value for each
wedge:
Example
Specify a new color for each wedge:
You can use Hexadecimal color values, any of the 140 supported color names,
or one of these shortcuts:
'r' - Red
'g' - Green
'b' - Blue
'c' - Cyan
'm' - Magenta
'y' - Yellow
'k' - Black
'w' - White
Legend
To add a list of explanation for each wedge, use the legend() function:
Example
Add a legend:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
Example
Add a legend with a header:
import matplotlib.pyplot as plt
import numpy as np
y = np.array([35, 25, 25, 15])
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
plt.pie(y, labels = mylabels)
plt.legend(title = "Four Fruits:")
plt.show()