0% found this document useful (0 votes)
20 views18 pages

Starting Out With Pandas - Ext

Uploaded by

Daniel Charles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views18 pages

Starting Out With Pandas - Ext

Uploaded by

Daniel Charles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Starting Out with Pandas

CMPG 111 - Introduction to Computing and


Programming

CMPG 111 SU8 1


I. Introduction to Data Analysis with Pandas
Pandas is a powerful Python library widely used for data manipulation and
analysis. It provides high-level data structures and functions designed to
make working with structured data fast, easy, and expressive. At the core of
Pandas lies the DataFrame, a two-dimensional labeled data structure
resembling a table or spreadsheet, and it o ers e icient tools to perform
various data operations e ectively.

A. What is Pandas?
Pandas is an open-source library built on top of NumPy, another popular
library for numerical computing. It was developed by Wes McKinney and
initially released in 2008. The name “Pandas” has a reference to both “Panel
Data” and “Python Data Analysis”.
Pandas is well-suited for a wide range of data-related tasks, including data
cleaning, preparation, exploration, and analysis. It primarily revolves around
the DataFrames (a type of data structure). A DataFrame is a two-dimensional
labeled data structures with columns of potentially di erent types. It is like
an Excel spreadsheet, where data is organised into rows and columns, each
with its own label or index.

B. Why use Pandas for Data Analysis?


Pandas o ers several advantages that make it a preferred choice for data
analysis tasks:

• Ease of use: Pandas provides simple syntax, making it accessible to users


of all levels, from beginners to advanced programmers.

•E iciency: Pandas is built for speed and performance, enabling fast data
processing even with large datasets.

CMPG 111 SU8 2


ff
ff
ff
ff
ff
ff
• Rich functionality: Pandas o ers a vast array of functions and methods for
data manipulation, cleaning, aggregation, and analysis.

• Integration: Pandas seamlessly integrates with other Python libraries and


tools commonly used in data analysis, such as NumPy, Matplotlib, and
Scikit-learn.

Pandas is suitable for a wide range of data analysis tasks across various
domains, including:

• Exploratory data analysis (EDA): Pandas facilitates quick and e icient


exploration of datasets, helping analysts gain insight into the underlying
patterns and relationships in the data.

• Data preprocessing: Pandas provides tools for cleaning and transforming


data, handling missing values, and preparing data for modelling and
visualisation.

• Data aggregation and summarisation: Pandas allows for grouping,


aggregating, and summarising data based on di erent criteria, enabling
the creation of informative summaries and reports.

• Data visualisation: Although Pandas itself is not primarily a visualisation


library, it integrates seamlessly with visualisation tools like Matplotlib and
Seaborn to create insightful plots and charts.

C. DataFrame and its relation to Dictionaries.


A DataFrame in Pandas can be thought of as a tabular representation of data,
similar to a spreadsheet. Each column of a DataFrame is essentially a list/
series, while each row represents an individual observation or record.
Conceptually, a DataFrame can be compared to a dictionary of list/series
objects. In fact, you can create a DataFrame from a dictionary, where the
keys become the column names and the values become the data within each
column.

CMPG 111 SU8 3


ff
ff
ff
II. Setting Up Pandas
Before we can start working with Pandas, we need to ensure that it is
installed in our Python environment. Additionally, we’ll install Matplotlib for
data visualisation purposes. While there are various ways to install Python
packages, we’ll cover the installation process using command prompt (cmd)
for Windows and the terminal for macOS. Also, Spyder IDE is recommended
as an alternative due to its helpful features for working with data.

A. Installing Pandas and Matplotlib


To install Pandas and Matplotlib, follow these steps:
1. Open Command Prompt (Windows) or Terminal (macOS):

• On Windows, you can open the Command Prompt by searching for


“cmd” in the Start menu.

• On macOS, you can open the Terminal by searching for it in Spotlight or


navigating to Application > Utilities > Terminal.
2. Installing Pandas and Matplotlib using pip:

• In the Command Prompt, type the following commands:


pip3 install pandas
pip3 install matplotlib
This will download and install the latest versions of Pandas and
Matplotlib for Python 3 from the Python Package Index (PyPI).

B. Importing Pandas
To import Pandas in your Python code and use the conventional alias pd,
include the following line at the beginning of your script:
import pandas as pd

This allows you to reference Pandas functions and objects using the
shorthand pd, which is a common practice among Python developers for
brevity and readability.

CMPG 111 SU8 4


III. Understanding DataFrames in Pandas
At the core of Pandas lies the DataFrame, a two-dimensional labeled data
structure resembling a table or spreadsheet. Understanding DataFrames is
essential for e ectively working with data in Pandas, as they serve as the
primary data structure for most data analysis tasks.

A. What is a DataFrame?
A DataFrame (df) is a two-dimensional labeled data structure with columns of
potentially di erent types. It is similar to a spreadsheet or SQL table, where
data is organised into rows and columns, each with its own label or index.

1. Components of a DataFrame:
• Rows: Each row in a DataFrame represents an individual observation or
record. Rows are identi ied by their index labels, which are typically
integers or strings.

• Columns: Each column in a DataFrame represent a particular variable or


feature. Columns have unique labels, ofter referred to as column names.

2. Characteristics of DataFrames:
• Tabular structure: DataFrames are organised in a tabular format, making
them easy to visualise and work with.

• Heterogeneous data: DataFrames can contain columns of di erent data


types, allowing for lexible handling of diverse datasets.

• Labeled axes: DataFrames have both row and column labels, enabling
e icient indexing and selection of data.

B. Creating DataFrames

CMPG 111 SU8 5


ff
ff
ff
f
f
ff
Pandas provides various methods for creating DataFrames from di erent
data sources:

1. From dictionaries:

import pandas as pd

data = {'Name': ['John', 'Emily', 'Jack', 'Sophia'],


'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}

df = pd.DataFrame(data)

This creates a DataFrame with three columns (Name, Age, City) and four
rows/records.

2. From external data sources:


Pandas can also read data from external sources such as CSV iles, Excel
iles, SQL databases, and more. For example:

df = pd.read_csv('data.csv')

This reads data from a CSV ile named ‘data.csv’ and creates a DataFrame.

CMPG 111 SU8 6


f
f
f
ff
C. Indexing and Selecting Data in DataFrames
DataFrames support various methods for indexing and selecting data,
allowing you to extract subsets of data based on speci ic criteria:

1. Using the Loc Function/Indexer:

# Selecting a single row by label


df.loc['Row_Label']

# Selecting multiple rows by label


df.loc[['Row_Label1', 'Row_Label2']]

# Selecting a single row and column by label


df.loc['Row_Label', 'Column_Label']

# Selecting multiple rows and columns by label


df.loc[['Row_Label1', 'Row_Label2'], ['Column1', 'Column2']]

# Selecting rows by boolean condition and columns by label


df.loc[df['Column'] > threshold, 'Column']

The .loc[] indexer is used for label-based indexing, allowing you to select
rows and columns based on their labels. The basic syntax for selecting data
with the loc function is:
df.loc[rows, columns], where rows and columns can be

single labels, lists of labels, slices, or boolean arrays.

The irst example (on the next page), Ex01, uses the .loc[] indexer to select

rows and columns from a DataFrame positioned on label-based indexing.


The expression selects rows where the value in the ‘Gender’ column is equal
to ‘male.
In the second example, Ex02, we use .loc[] to select rows positioned on

label-based indexing. We ilter the rows using conditions on multiple


columns by using logical operators & (and) and | (or) to combine conditions.

The expression ilters rows where the ‘Age’ column is greater than 30 and the
‘Gender’ column is ‘male’.

CMPG 111 SU8 7


f
f
f
f
# Ex01: Selecting rows where the 'Gender' column is
# equal to 'male' using .loc[]
df.loc[df['Gender'] == 'male']

# Ex02: Selecting rows where 'Age' is greater than 30 and


# ’Gender’ is 'male' using .loc[]
df.loc[(df['Age'] > 30) & (df['Gender'] == ‘male')]

# Ex03: Selecting a range of rows (from index 0 to 5) and specific


columns ('Gender' and 'Age') using .loc[]
df.loc[0:5, ['Gender', 'Age']]

The last example, Ex03, we use .loc[] to select a range of rows (from index

0 to 5) based on a slice (integers), and speci ic columns (‘Gender’ and ‘Age’)


based on label-indexing for columns.

2. Using the iLoc Function/Indexer:

# Selecting a single row by integer index


df.iloc[0]

# Selecting multiple rows by integer index


df.iloc[[0, 1, 2]]

# Selecting a single element by integer index


df.iloc[0, 0]

# Selecting a range of rows and columns by integer index


df.iloc[0:3, 1:4]

# Selecting rows by boolean condition and columns by integer index


df.iloc[df['Column'] > threshold, 0:2]

The .iloc[] indexer is used for integer-location based indexing, allowing you
to select rows and columns based on their integer positions. The basic
syntax for selecting data with iloc[] is:
df.iloc[rows, columns], where rows and columns can be

single integers, lists of integers, slices, or boolean arrays.

CMPG 111 SU8 8


f
In both examples, the .iloc[] indexer is used for integer-location based
indexing for both rows and columns. Ex04 selects the element at the irst

row and irst column (index 0 for both), whereas Ex05 selects a range of rows

(from index 0 to 3) and a range of columns (from index 2 to 4).

# Ex04: Selecting a specific element at row index 0 and column index 0


df.iloc[0, 0]

# Ex05: Selecting a range of rows (from index 0 to 3) and


# columns (from index 2 to 4)
df.iloc[0:4, 2:5]

D. Summary
Understanding how to index and select data from DataFrames using .loc[]

and .iloc[] is important for data manipulation and analysis in Pandas. By

mastering these indexing and selecting methods, you can e iciently extract
subsets of data for further processing and exploration.

CMPG 111 SU8 9


f
ff
f
IV. Working with DataFrames
Now that we have a basic understanding of DataFrames (DFs) in Pandas and
how to create them, let's delve deeper into working with them. This section
covers various operations and techniques for manipulating, analysing, and
visualising data within DFs.

A. Basic Operation
1. Viewing DF Information:

# Display the first 10 rows of the DataFrame


print(df.head(10))

# Display the last 5 rows of the DataFrame


print(df.tail(5))

# Display basic information about the DataFrame


print(df.info())

# Display summary statistics of numerical columns


print(df.describe())

These operations allow you to quickly inspect the structure and contents of
the DataFrame, as well as obtain summary statistics for numerical columns.
When using the head() and tail() functions, you can pass any integer

number as an argument to specify the number of rows to display. For


example, df.head(10) displays the irst 10 rows of the DataFrame. If no

argument is passed, the default number of returned rows is typically 5.

2. Adding and Removing Columns:

# Adding a new column to the DataFrame


df['New Column'] = values

# Removing a column from the DataFrame


df.drop(columns=['Column'], inplace=True)

These operations enable you to add new columns to the DataFrame or


remove existing columns as needed.

CMPG 111 SU8 10


f
B. Data Manipulation
1. Filtering Data:

# Filtering rows based on a condition using DFs


filtered_df = df[df['Column'] > threshold]

This operation allows you to select rows from the DataFrame that meets a
speci ic condition. In this example, the statement ilters rows from the
DataFrame (df) where the values in the ‘Column’ column are greater than
threshold.
NOTE: This method of iltering directly using df without .loc[] or .iloc[]
achieves the same result as the previously covered methods but in a more
concise and direct manner (see code snippet below). While .loc[] and iloc[]
provide more explicit ways of indexing and selecting data from DataFrames,
using only df for iltering based on conditions o ers a more straightforward
approach, especially for simple iltering tasks.

# Filtering rows based on a condition using df.loc[]


filtered_df_loc = df.loc[df['Column'] > threshold]

2. Sorting Data:

# Sorting DF by a single column


sorted_df = df.sort_values(by='Column', ascending=True)

Sorting the DataFrame allows you to arrange the data in a speci ic order
based on the values of one or more columns.

CMPG 111 SU8 11


f
f
f
f
ff
f
f
C. Data Aggregation and Grouping
1. Grouping Data:

# Grouping DataFrame by a column


grouped_df = df.groupby(‘Column')

2. Aggregating Data

# Applying aggregation functions to grouped data


aggregated_df = grouped_df.agg({‘Column2’: ‘mean’,
‘Column3’: ‘sum’})

D. Data Visualisation
Please refer to SU6, Chapter 7 for more information on Matplotlib.

1. Basic Plotting:

# Plotting a line chart


df.plot(x='Column1', y='Column2', kind='line')

# Plotting a bar chart


df.plot(x='Category', y='Values', kind='bar')

These operations enable you to visualise the data within the DataFrame using
di erent types of plots.

2. Customising Plots:

# Adding titles and labels to the plot


plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')

# Changing plot style and colors


df.plot(style='--', color='red')

Customising plots allows you to enhance the appearance and readability of


the visualisations.

CMPG 111 SU8 12


ff
E. Summary
This section provides an overview of various operations and techniques for
working with DFs in Pandas. By mastering these techniques, you'll be
equipped to e iciently manipulate, analyse, and visualise data within them,
enabling you to gain valuable insights from your datasets.

CMPG 111 SU8 13


ff
V. Loading and Saving Data
Pandas provides convenient functions for loading data from various ile
formats and saving DataFrame objects to disk. This section covers loading
data from external sources into Pandas DFs and saving DFs to iles for future
use.

A. Loading Data into Pandas


1. From CSV iles:
df = pd.read_csv(‘data.csv’)

2. From Excel iles:


df = pd.read_excel(‘data.xlsx’, sheet_name=‘Sheet1’)

3. From JSON iles:


df = pd.read_json(‘data.json’)

B. Saving Data from Pandas


1. To CSV ile:
df.to_csv(‘output.csv’, index=False)

2. To Excel ile:
df.to_excel(‘output.csv’, sheet_name=‘Sheet1’, index=False)

3. To JSON ile:
df.to_json(‘output.json’, orient=‘records’)

C. Summary
Loading and saving data is a fundamental aspect of working with DFs in
Pandas. By mastering these techniques, you'll be able to e iciently import
data into Pandas for analysis and visualisation, as well as export processed
data for sharing or future use.

CMPG 111 SU8 14


f
f
f
f
f
f
ff
f
f
VI. Integrated Practical Example
A. Reading Files in with Pandas
Create a new script called pandas_demo.py.
Download “data.zip” folder. Unzip it. You will ind the ile called
“country_data.csv”, puts this ile in the same folder as your Python script.
Now, input the following code in the script and run it:

import pandas
data_frame = pandas.read_csv("country_data.csv")
print(data_frame)

You should see the following output in the console:

Age Gender Country


0 39 M South Africa
1 25 M Botswana
2 29 F South Africa
3 46 M South Africa
4 22 F Kenya
5 35 F Mozambique
6 22 F Lesotho
7 49 M Kenya
8 30 M Kenya
9 40 F Egypt
10 30 M Sudan

Note: If you are using Spyder, you can click on the “variable explorer” above
the console to see a variable called “data_frame”. If you double-click on
“data_frame” you should see your data in a spreadsheet format.

And, that is it, with three (3) lines of code you read in a CSV ile in Python that
you normally would open up in Excel.
Now, let’s add two useful lines of code.

The irst is:

print(data_frame.info())

CMPG 111 SU8 15


f
f
f
f
f
Console output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 11 non-null int64
1 Gender 11 non-null object
2 Country 11 non-null object
dtypes: int64(1), object(2)
memory usage: 392.0+ bytes

This function is useful to summarise tables for you. Information provided


include, amount of columns, their names, non-null count, and the Dtype/data
type.

The second useful function of pandas is the following:

print(data_frame.describe())

Console output:

Age
count 11.000000
mean 33.363636
std 9.233339
min 22.000000
25% 27.000000
50% 30.000000
75% 39.500000
max 49.000000

The describe() function is utilised for producing descriptive statistics for


either a DataFrame or a Series. It o ers an overview of the central tendency,
spread, and distribution shape of a dataset. Its main purpose is to summarise
numerical data, presenting statistics that hold signi icance for quantitative
variables.

CMPG 111 SU8 16


ff
f
B. Exploring Data and Manipulating Columns in Pandas
Let’s look at the same country data example:

import pandas as pd
df = pd.read_csv(“country_data.csv”)

Note: import pandas as pd and import pandas are essentially the same

thing. The pd is just a shorthand or shortcut way of using the pandas

module.

Column-Based Access:

# Accessing specific columns


print(df['age'])
print(df['gender'])

Descriptive Statistics:

# Basic statistics
print(df['age'].min())
print(df['age'].max())
print(df['age'].mean())

Filtering and Slicing:

# Filtering data
print(df[df['age'] > 30])

# Slicing rows and columns


print(df[1:4]) # Select rows 1 to 3 and all columns

Adding and Removing a Column:

# Adding a new column


df['new_column'] = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print(df)

# Removing a column
df.drop(columns=['new_column'], inplace=True)
print(df)

Well done if you have gotten this far. Remember to try the above examples
out yourself!

CMPG 111 SU8 17


References
• Builtin. (n.d.). Pandas iloc: How to Use iloc in Pandas for Data Science.
Retrieved from https://builtin.com/software-engineering-perspectives/
pandas-iloc#
• McKinney, W. (2018). Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and IPython. O'Reilly Media.
• Barsch, B. (2024). Coding Summer School (CSS). Organised by the Centre
for High Performance Computing (CHPC) of the CSIR and the National
Institute for Theoretical Computational Science (NITheCS). South Africa.
• Pandas Documentation. (n.d.). Installation. Retrieved from https://
pandas.pydata.org/pandas-docs/stable/getting_started/install.html
• Pandas Documentation. (n.d.). IO tools (text, CSV, HDF5, …). Retrieved from
https://pandas.pydata.org/pandas-docs/stable/reference/io.html
• Pandas Documentation. (n.d.). Pandas API Documentation. Retrieved from
https://pandas.pydata.org/docs/
• VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for
Working with Data. O'Reilly Media.

CMPG 111 SU8 18

You might also like