Analysis of Algorithms: Matplotlib and Pandas Dataframe
Analysis of Algorithms: Matplotlib and Pandas Dataframe
Analysis of
Algorithms
Matplotlib and
Pandas
DataFrame
By
A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure in the pandas
library, a popular Python library for data manipulation and analysis. It is similar to a spreadsheet or SQL table,
where data is arranged in rows and columns.
Each column in a DataFrame represents a variable or feature, while each row represents a single observation
or data point. DataFrames can hold various types of data, including integers, floats, strings, and even other
Python objects.
Dr Mohammed Al-Hubaishi
3
Import NumPy and pandas modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.
Dr Mohammed Al-Hubaishi
4
Local file system
df.head (10)
https://colab.research.google.com/drive/1gjY9cyEVebn0OkeDL6TkFu3atpYxpS6G?usp=sharing
Dr Mohammed Al-Hubaishi
5
Examples
Car
https://colab.research.google.com/drive/1dV2TOcyeDA484i1A9qVY8B4jobCndczv
Aloha
https://colab.research.google.com/drive/1WJ3H9DeJcXvT9Sz5Qbut20YAh9jfdrVC
6
Creating a DataFrame
The following code cell creates a simple DataFrame containing 10 cells organized as follows:
● 5 rows
● 2 columns, one named temperature and the other named activity
The following code cell instantiates a pd.DataFrame class to generate a DataFrame. The class takes two
arguments:
● The first argument provides the data to populate the 10 cells. The code cell calls np.array to generate
the 5x2 NumPy array.
● The second argument identifies the names of the two columns.
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])
# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']
# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)
# Print the entire DataFrame
Dr Mohammed Al-Hubaishi
print(my_dataframe)
7
Adding a new column to a DataFrame
You may add a new column to an existing pandas DataFrame just by assigning values to
a new column name. For example, the following code creates a third column named
adjusted in my_dataframe:
Dr Mohammed Al-Hubaishi
8
Specifying a subset of a DataFrame
Pandas provide multiples ways to isolate specific rows, columns, slices or cells in
a DataFrame.
print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')
print("Column 'temperature':")
print(my_dataframe['temperature'])
Dr Mohammed Al-Hubaishi
9
Viewing the Data
► One of the most used method for getting a quick overview of the DataFrame, is the head() method.
► The head() method returns the headers and a specified number of rows, starting from the top.
Dr Mohammed Al-Hubaishi
10
show the last rows of the DataFrame
► There is also a tail() method for viewing the last rows of the DataFrame.
► The tail() method returns the headers and a specified number of rows, starting from the bottom.
Dr Mohammed Al-Hubaishi
11
Info About the Data
► The DataFrames object has a method called info(), that gives you more information about the data set.
Dr Mohammed Al-Hubaishi
12
Result Explained
Dr Mohammed Al-Hubaishi
13
Cleaning Data
Dr Mohammed Al-Hubaishi
14
Data Set example
The data set contains some empty cells ("Date" in row 22, and
"Calories" in row 18 and 28).
Dr Mohammed Al-Hubaishi
15
Cleaning Empty Cells
Remove Rows
► One way to deal with empty cells is to remove rows that contain empty
cells.
► This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.
Note: By default, the dropna() method returns a new DataFrame, and will not change the
original.
Dr Mohammed Al-Hubaishi
16
Remove all rows with NULL values:
Dr Mohammed Al-Hubaishi
17
Replace Empty Values
► Another way of dealing with empty cells is to insert a new value instead.
► This way you do not have to delete entire rows just because of some empty cells.
► The fillna() method allows us to replace empty cells with a value:
Replace NULL values with the number 130:
Dr Mohammed Al-Hubaishi
18
Replace Only For Specified Columns
► The example above replaces all empty cells in the whole Data Frame.
► To only replace empty values for one column, specify the column name for the DataFrame:
Replace NULL values in the "Calories" columns with the number 130:
Dr Mohammed Al-Hubaishi
19
Replace Using Mean, Median, or Mode
► A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
► Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:
Mean = the average value (the sum of all values divided by number of values).
Median = the value in the middle, after you have sorted all values ascending.
Mode = the value that appears most frequently.
Dr Mohammed Al-Hubaishi
20
Convert Into a Correct Format
Dr Mohammed Al-Hubaishi
21
Removing Rows
► The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.
Dr Mohammed Al-Hubaishi
22
Fixing Wrong Data
► If you take a look at our data set, you can see that in row 7,
the duration is 450, but for all the other rows the duration is
between 30 and 60.
► It doesn't have to be wrong, but taking in consideration that To replace wrong data for larger data sets you can create
some rules, e.g. set some boundaries for legal values,
this is the data set of someone's workout sessions, we and replace any values that are outside of the
conclude with the fact that this person did not work out in boundaries.
450 minutes.
► How can we fix wrong values, like the one for
"Duration" in row 7 ?
Dr Mohammed Al-Hubaishi
23
Removing Duplicates
Dr Mohammed Al-Hubaishi
24
Removing Duplicates
Dr Mohammed Al-Hubaishi
25
What is Matplotlib?
► Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
► Matplotlib was created by John D. Hunter.
► Matplotlib is open source and we can use it freely.
► Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript for
Platform compatibility.
► If you have Python and PIP already installed on a system, then installation of Matplotlib is very easy.
► Install it using this command:
► If you do not have Python, you can login into your gmail and use Colab google
► https://colab.research.google.com/
► Once Matplotlib is installed, import it in your applications by adding the import module statement:
► Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt
alias:
The plot() function is used to draw points (markers) in a diagram. By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.
Parameter 1 is an array containing the points on the x-axis.
Parameter 2 is an array containing the points on the y-axis.
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to the plot function.
Example : Draw a line in a diagram from position (1, 3) to position (8, 10):
► To plot only the markers, you can use shortcut string notation parameter 'o', which means 'rings'.
► You can plot as many points as you like, just make sure you have the same number of points in both axis.
► Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to position (8, 10):
► If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3, (etc. depending on
the length of the y-points. So, if we take the same example as above, and leave out the x-points, the
diagram will look like this: Plotting without x-points:
► You can use the keyword argument marker to emphasize each point with a specified marker:
You can use the keyword argument You can use the keyword argument markerfacecolor
markeredgecolor or the shorter mec to set the or the shorter mfc to set the color inside the edge of
color of the edge of the markers: the markers:
► Use both the mec and mfc arguments to color of the entire marker:
► Linestyle : You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line:
► You can use the keyword argument color or the shorter c to set the color of the line:
► You can use the keyword argument linewidth or the shorter lw to change the width of the line.
► The value is a floating number, in points:
► You can plot as many lines as you like by simply adding more plt.plot() functions:
► In the example above, there seems to be a relationship between speed and age, but what if we plot the
observations from another day as well? Will the scatter plot tell us something else?
► You can even set a specific color for each dot by using an array of colors as value for the c argument:
Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:
► The bar() takes the keyword argument width to set the width of the bars:
► With Pyplot, you can use the pie() function to draw pie charts:
► As mentioned the default start angle is at the x-axis, but you can change the start angle by specifying a
startangle parameter.
► The startangle parameter is defined with an angle in degrees, default angle is 0:
https://docs.omnetpp.org/tutorials/pandas/
The CSV file has a fixed number of columns named run, type, module, name, value, etc. Each result item,
i.e. scalar, statistic, histogram and vector, produces one row of output in the CSV.
Other items such as run attributes, iteration variables of the parameter study and result attributes also
generate their own rows.
The content of the type column determines what type of information a given row contains. The type column
also determines which other columns are in use.
For example, the binedges and binvalues columns are only filled in for histogram items. The colums are:
58
Data frame from Aloha
59
Aloha scenarios
https://docs.omnetpp.org/tutorials/pandas/
60
61
62
Results
63
Results
64
Aloha - folder
► https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ul
traquick_tutorial.ipynb#scrollTo=NO_4WhdJZ5Pm
► https://www.w3schools.com/python/pandas/pandas_dataframes.asp
► https://www.w3schools.com/python/pandas/pandas_cleaning.asp
► https://www.w3schools.com/python/pandas/pandas_analyzing.asp
Dr Mohammed Al-Hubaishi
66
References
► https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ul
traquick_tutorial.ipynb#scrollTo=NO_4WhdJZ5Pm
Dr Mohammed Al-Hubaishi
67
DR : Mohammed Al-Hubaishi