0% found this document useful (0 votes)
6 views67 pages

Analysis of Algorithms: Matplotlib and Pandas Dataframe

The document provides an overview of Pandas DataFrame and Matplotlib for data manipulation and visualization in Python. It explains how to create, modify, and clean DataFrames, as well as how to plot data using Matplotlib's various functions. Key topics include adding columns, handling missing data, and creating different types of plots such as line, scatter, bar, and pie charts.

Uploaded by

jale.cavus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views67 pages

Analysis of Algorithms: Matplotlib and Pandas Dataframe

The document provides an overview of Pandas DataFrame and Matplotlib for data manipulation and visualization in Python. It explains how to create, modify, and clean DataFrames, as well as how to plot data using Matplotlib's various functions. Key topics include adding columns, handling missing data, and creating different types of plots such as line, scatter, bar, and pie charts.

Uploaded by

jale.cavus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

1

Analysis of
Algorithms
Matplotlib and
Pandas
DataFrame
By

Dr. Mohammed Al-Hubaishi


Dr. Mohammed Al-Hubaishi
2
Pandas DataFrame

A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure in the pandas
library, a popular Python library for data manipulation and analysis. It is similar to a spreadsheet or SQL table,
where data is arranged in rows and columns.

Each column in a DataFrame represents a variable or feature, while each row represents a single observation
or data point. DataFrames can hold various types of data, including integers, floats, strings, and even other
Python objects.

● A DataFrame stores data in cells.


● A DataFrame has named columns (usually) and numbered rows.

Dr Mohammed Al-Hubaishi
3
Import NumPy and pandas modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

NumPy offers comprehensive mathematical functions, random number generators,


linear algebra routines, Fourier transforms, and more.

pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.

Dr Mohammed Al-Hubaishi
4
Local file system

► Uploading files from your local file system


files.upload returns a dictionary of the files which were uploaded. The dictionary is keyed by the file name and
values are the data which were uploaded

from google.colab import files


uploaded= files.upload()
#Assigning it to a "dataset" variable
df = pd.read_csv('Rdataset.csv')

df.head (10)

https://colab.research.google.com/drive/1gjY9cyEVebn0OkeDL6TkFu3atpYxpS6G?usp=sharing
Dr Mohammed Al-Hubaishi
5
Examples

Car
https://colab.research.google.com/drive/1dV2TOcyeDA484i1A9qVY8B4jobCndczv
Aloha
https://colab.research.google.com/drive/1WJ3H9DeJcXvT9Sz5Qbut20YAh9jfdrVC
6
Creating a DataFrame

The following code cell creates a simple DataFrame containing 10 cells organized as follows:
● 5 rows
● 2 columns, one named temperature and the other named activity
The following code cell instantiates a pd.DataFrame class to generate a DataFrame. The class takes two
arguments:

● The first argument provides the data to populate the 10 cells. The code cell calls np.array to generate
the 5x2 NumPy array.
● The second argument identifies the names of the two columns.
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])
# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']
# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)
# Print the entire DataFrame
Dr Mohammed Al-Hubaishi
print(my_dataframe)
7
Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to
a new column name. For example, the following code creates a third column named
adjusted in my_dataframe:

# Create a new column named adjusted.


my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame


print(my_dataframe)

Dr Mohammed Al-Hubaishi
8
Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in
a DataFrame.

print("Rows #0, #1, and #2:")


print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")


print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])
Dr Mohammed Al-Hubaishi
9
Viewing the Data

► One of the most used method for getting a quick overview of the DataFrame, is the head() method.
► The head() method returns the headers and a specified number of rows, starting from the top.

Dr Mohammed Al-Hubaishi
10
show the last rows of the DataFrame

► There is also a tail() method for viewing the last rows of the DataFrame.
► The tail() method returns the headers and a specified number of rows, starting from the bottom.

Dr Mohammed Al-Hubaishi
11
Info About the Data

► The DataFrames object has a method called info(), that gives you more information about the data set.

Dr Mohammed Al-Hubaishi
12
Result Explained

Dr Mohammed Al-Hubaishi
13
Cleaning Data

Dr Mohammed Al-Hubaishi
14
Data Set example

The data set contains some empty cells ("Date" in row 22, and
"Calories" in row 18 and 28).

The data set contains wrong format ("Date" in row 26).

The data set contains wrong data ("Duration" in row 7).

The data set contains duplicates (row 11 and 12).

Dr Mohammed Al-Hubaishi
15
Cleaning Empty Cells
Remove Rows
► One way to deal with empty cells is to remove rows that contain empty
cells.
► This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.

Note: By default, the dropna() method returns a new DataFrame, and will not change the
original.
Dr Mohammed Al-Hubaishi
16
Remove all rows with NULL values:

► If you want to change the original DataFrame, use the


inplace = True argument:

Dr Mohammed Al-Hubaishi
17
Replace Empty Values

► Another way of dealing with empty cells is to insert a new value instead.
► This way you do not have to delete entire rows just because of some empty cells.
► The fillna() method allows us to replace empty cells with a value:
Replace NULL values with the number 130:

Dr Mohammed Al-Hubaishi
18
Replace Only For Specified Columns

► The example above replaces all empty cells in the whole Data Frame.
► To only replace empty values for one column, specify the column name for the DataFrame:

Replace NULL values in the "Calories" columns with the number 130:

Dr Mohammed Al-Hubaishi
19
Replace Using Mean, Median, or Mode

► A common way to replace empty cells, is to calculate the mean, median or mode value of the column.
► Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column:

Mean = the average value (the sum of all values divided by number of values).
Median = the value in the middle, after you have sorted all values ascending.
Mode = the value that appears most frequently.
Dr Mohammed Al-Hubaishi
20
Convert Into a Correct Format

In our Data Frame, we have two cells with the


wrong format. Check out row 22 and 26, the 'Date'
column should be a string that represents a date:

Dr Mohammed Al-Hubaishi
21
Removing Rows

► The result from the converting in the example above gave us a NaT value, which can be handled as a
NULL value, and we can remove the row by using the dropna() method.

Dr Mohammed Al-Hubaishi
22
Fixing Wrong Data

► If you take a look at our data set, you can see that in row 7,
the duration is 450, but for all the other rows the duration is
between 30 and 60.
► It doesn't have to be wrong, but taking in consideration that To replace wrong data for larger data sets you can create
some rules, e.g. set some boundaries for legal values,
this is the data set of someone's workout sessions, we and replace any values that are outside of the
conclude with the fact that this person did not work out in boundaries.
450 minutes.
► How can we fix wrong values, like the one for
"Duration" in row 7 ?

Dr Mohammed Al-Hubaishi
23
Removing Duplicates

► Duplicate rows are rows that have been registered more


than one time.
► By taking a look at our test data set, we can assume that
row 11 and 12 are duplicates.
► To discover duplicates, we can use the duplicated()
method.
► The duplicated() method returns a Boolean values for each
row:

Dr Mohammed Al-Hubaishi
24
Removing Duplicates

► To remove duplicates, use the drop_duplicates() method.


Remember: The (inplace = True) will make sure that
the method does NOT return a new DataFrame, but
it will remove all duplicates from the original
DataFrame.

Dr Mohammed Al-Hubaishi
25
What is Matplotlib?

► Matplotlib is a low level graph plotting library in python that serves as a visualization utility.
► Matplotlib was created by John D. Hunter.
► Matplotlib is open source and we can use it freely.
► Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript for
Platform compatibility.

Dr. Mohammed Al-Hubaishi


26
Installation of Matplotlib

► If you have Python and PIP already installed on a system, then installation of Matplotlib is very easy.
► Install it using this command:

► If you do not have Python, you can login into your gmail and use Colab google
► https://colab.research.google.com/

Dr. Mohammed Al-Hubaishi


27
Import Matplotlib

► Once Matplotlib is installed, import it in your applications by adding the import module statement:

Dr. Mohammed Al-Hubaishi


28
Matplotlib Pyplot

► Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt
alias:

Now the Pyplot package can be referred to as plt.

Dr. Mohammed Al-Hubaishi


29
Plotting x and y points

The plot() function is used to draw points (markers) in a diagram. By default, the plot() function draws a line from point to point.
The function takes parameters for specifying points in the diagram.
Parameter 1 is an array containing the points on the x-axis.
Parameter 2 is an array containing the points on the y-axis.
If we need to plot a line from (1, 3) to (8, 10), we have to pass two arrays [1, 8] and [3, 10] to the plot function.

Example : Draw a line in a diagram from position (1, 3) to position (8, 10):

Dr. Mohammed Al-Hubaishi


30
Plotting Without Line

► To plot only the markers, you can use shortcut string notation parameter 'o', which means 'rings'.

Dr. Mohammed Al-Hubaishi


31
Multiple Points

► You can plot as many points as you like, just make sure you have the same number of points in both axis.
► Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to position (8, 10):

Dr. Mohammed Al-Hubaishi


32
Default X-Points

► If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3, (etc. depending on
the length of the y-points. So, if we take the same example as above, and leave out the x-points, the
diagram will look like this: Plotting without x-points:

Dr. Mohammed Al-Hubaishi


33
Matplotlib Markers

► You can use the keyword argument marker to emphasize each point with a specified marker:

Dr. Mohammed Al-Hubaishi


34
Marker Reference

Dr. Mohammed Al-Hubaishi


35
Format Strings fmt

► You can use also use the shortcut string


notation parameter to specify the marker.

Dr. Mohammed Al-Hubaishi


36
Marker Size

► You can use the keyword argument markersize


or the shorter version, ms to set the size of the
markers:

Dr. Mohammed Al-Hubaishi


37
Marker Color

You can use the keyword argument You can use the keyword argument markerfacecolor
markeredgecolor or the shorter mec to set the or the shorter mfc to set the color inside the edge of
color of the edge of the markers: the markers:

Dr. Mohammed Al-Hubaishi


38
Marker Color (con’t)

► Use both the mec and mfc arguments to color of the entire marker:

Dr. Mohammed Al-Hubaishi


39
Matplotlib Line

► Linestyle : You can use the keyword argument linestyle, or shorter ls, to change the style of the
plotted line:

Dr. Mohammed Al-Hubaishi


40
Shorter Syntax

Dr. Mohammed Al-Hubaishi


41
Line Color

► You can use the keyword argument color or the shorter c to set the color of the line:

Dr. Mohammed Al-Hubaishi


42
Line Width

► You can use the keyword argument linewidth or the shorter lw to change the width of the line.
► The value is a floating number, in points:

Dr. Mohammed Al-Hubaishi


43
Multiple Lines

► You can plot as many lines as you like by simply adding more plt.plot() functions:

Dr. Mohammed Al-Hubaishi


44
Draw two lines by specifiyng
the x- and y-point values for both lines:

Dr. Mohammed Al-Hubaishi


45
Matplotlib Subplot

► Display Multiple Plots


► With the subplot() function you can draw multiple plots in one figure:

The subplot() function takes


three arguments that describes
the layout of the figure.

The layout is organized in rows


and columns, which are
represented by the first and
second argument.

The third argument represents


the index of the current plot.

Dr. Mohammed Al-Hubaishi


Draw 6 plots: You can draw as many plots you like on 46
one figure, just descibe the number of rows, columns, and the index
of the plot.

Dr. Mohammed Al-Hubaishi


47
Matplotlib Scatter

► Creating Scatter Plots


► With Pyplot, you can use the scatter() function to draw a scatter plot.
► The scatter() function plots one dot for each observation. It needs two arrays of the same length, one for
the values of the x-axis, and one for values on the y-axis:

Dr. Mohammed Al-Hubaishi


48
Compare Plots

► In the example above, there seems to be a relationship between speed and age, but what if we plot the
observations from another day as well? Will the scatter plot tell us something else?

Dr. Mohammed Al-Hubaishi


49
Color Each Dot

► You can even set a specific color for each dot by using an array of colors as value for the c argument:

Dr. Mohammed Al-Hubaishi


50
Matplotlib Bars

Creating Bars
With Pyplot, you can use the bar() function to draw bar graphs:

Dr. Mohammed Al-Hubaishi


51
Bar Color
The bar() and barh() takes the keyword argument color to set the color of the bars:

Dr. Mohammed Al-Hubaishi


52
Bar Width

► The bar() takes the keyword argument width to set the width of the bars:

Dr. Mohammed Al-Hubaishi


53
Creating Pie Charts

► With Pyplot, you can use the pie() function to draw pie charts:

Dr. Mohammed Al-Hubaishi


54
Labels: Pie Charts

► Add labels to the pie chart with the label parameter.


► The label parameter must be an array with one label for
each wedge:

Dr. Mohammed Al-Hubaishi


55
Start Angle

► As mentioned the default start angle is at the x-axis, but you can change the start angle by specifying a
startangle parameter.
► The startangle parameter is defined with an angle in degrees, default angle is 0:

Dr. Mohammed Al-Hubaishi


56
Omnet++ Analyzer

Dr. Mohammed Al-Hubaishi


57
Result Analysis with Python

https://docs.omnetpp.org/tutorials/pandas/
The CSV file has a fixed number of columns named run, type, module, name, value, etc. Each result item,
i.e. scalar, statistic, histogram and vector, produces one row of output in the CSV.
Other items such as run attributes, iteration variables of the parameter study and result attributes also
generate their own rows.
The content of the type column determines what type of information a given row contains. The type column
also determines which other columns are in use.
For example, the binedges and binvalues columns are only filled in for histogram items. The colums are:
58
Data frame from Aloha
59
Aloha scenarios

https://docs.omnetpp.org/tutorials/pandas/
60
61
62
Results
63
Results
64
Aloha - folder

Dr. Mohammed Al-Hubaishi


65
References

► https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ul
traquick_tutorial.ipynb#scrollTo=NO_4WhdJZ5Pm

► https://www.w3schools.com/python/pandas/pandas_dataframes.asp
► https://www.w3schools.com/python/pandas/pandas_cleaning.asp
► https://www.w3schools.com/python/pandas/pandas_analyzing.asp

Dr Mohammed Al-Hubaishi
66
References

► https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ul
traquick_tutorial.ipynb#scrollTo=NO_4WhdJZ5Pm

► Matplotlib Tutorial (w3schools.com)


► https://omnetpp-wasm-demo.web.app/aloha/index.html
► https://drive.google.com/file/d/1uFThxdD6IerYnftbuupmup5NPOnXXIm9/view?usp=sharing
► https://colab.research.google.com/drive/1WJ3H9DeJcXvT9Sz5Qbut20YAh9jfdrVC?usp=sharing
► https://docs.omnetpp.org/tutorials/pandas/

Dr Mohammed Al-Hubaishi
67

…questions, comments, etc. are welcome…

DR : Mohammed Al-Hubaishi

You might also like