Starting Out With Pandas - Ext
Starting Out With Pandas - Ext
A. What is Pandas?
Pandas is an open-source library built on top of NumPy, another popular
library for numerical computing. It was developed by Wes McKinney and
initially released in 2008. The name “Pandas” has a reference to both “Panel
Data” and “Python Data Analysis”.
Pandas is well-suited for a wide range of data-related tasks, including data
cleaning, preparation, exploration, and analysis. It primarily revolves around
the DataFrames (a type of data structure). A DataFrame is a two-dimensional
labeled data structures with columns of potentially di erent types. It is like
an Excel spreadsheet, where data is organised into rows and columns, each
with its own label or index.
•E iciency: Pandas is built for speed and performance, enabling fast data
processing even with large datasets.
Pandas is suitable for a wide range of data analysis tasks across various
domains, including:
B. Importing Pandas
To import Pandas in your Python code and use the conventional alias pd,
include the following line at the beginning of your script:
import pandas as pd
This allows you to reference Pandas functions and objects using the
shorthand pd, which is a common practice among Python developers for
brevity and readability.
A. What is a DataFrame?
A DataFrame (df) is a two-dimensional labeled data structure with columns of
potentially di erent types. It is similar to a spreadsheet or SQL table, where
data is organised into rows and columns, each with its own label or index.
1. Components of a DataFrame:
• Rows: Each row in a DataFrame represents an individual observation or
record. Rows are identi ied by their index labels, which are typically
integers or strings.
2. Characteristics of DataFrames:
• Tabular structure: DataFrames are organised in a tabular format, making
them easy to visualise and work with.
• Labeled axes: DataFrames have both row and column labels, enabling
e icient indexing and selection of data.
B. Creating DataFrames
1. From dictionaries:
import pandas as pd
df = pd.DataFrame(data)
This creates a DataFrame with three columns (Name, Age, City) and four
rows/records.
df = pd.read_csv('data.csv')
This reads data from a CSV ile named ‘data.csv’ and creates a DataFrame.
The .loc[] indexer is used for label-based indexing, allowing you to select
rows and columns based on their labels. The basic syntax for selecting data
with the loc function is:
df.loc[rows, columns], where rows and columns can be
The irst example (on the next page), Ex01, uses the .loc[] indexer to select
The expression ilters rows where the ‘Age’ column is greater than 30 and the
‘Gender’ column is ‘male’.
The last example, Ex03, we use .loc[] to select a range of rows (from index
The .iloc[] indexer is used for integer-location based indexing, allowing you
to select rows and columns based on their integer positions. The basic
syntax for selecting data with iloc[] is:
df.iloc[rows, columns], where rows and columns can be
row and irst column (index 0 for both), whereas Ex05 selects a range of rows
D. Summary
Understanding how to index and select data from DataFrames using .loc[]
mastering these indexing and selecting methods, you can e iciently extract
subsets of data for further processing and exploration.
A. Basic Operation
1. Viewing DF Information:
These operations allow you to quickly inspect the structure and contents of
the DataFrame, as well as obtain summary statistics for numerical columns.
When using the head() and tail() functions, you can pass any integer
This operation allows you to select rows from the DataFrame that meets a
speci ic condition. In this example, the statement ilters rows from the
DataFrame (df) where the values in the ‘Column’ column are greater than
threshold.
NOTE: This method of iltering directly using df without .loc[] or .iloc[]
achieves the same result as the previously covered methods but in a more
concise and direct manner (see code snippet below). While .loc[] and iloc[]
provide more explicit ways of indexing and selecting data from DataFrames,
using only df for iltering based on conditions o ers a more straightforward
approach, especially for simple iltering tasks.
2. Sorting Data:
Sorting the DataFrame allows you to arrange the data in a speci ic order
based on the values of one or more columns.
2. Aggregating Data
D. Data Visualisation
Please refer to SU6, Chapter 7 for more information on Matplotlib.
1. Basic Plotting:
These operations enable you to visualise the data within the DataFrame using
di erent types of plots.
2. Customising Plots:
2. To Excel ile:
df.to_excel(‘output.csv’, sheet_name=‘Sheet1’, index=False)
3. To JSON ile:
df.to_json(‘output.json’, orient=‘records’)
C. Summary
Loading and saving data is a fundamental aspect of working with DFs in
Pandas. By mastering these techniques, you'll be able to e iciently import
data into Pandas for analysis and visualisation, as well as export processed
data for sharing or future use.
import pandas
data_frame = pandas.read_csv("country_data.csv")
print(data_frame)
Note: If you are using Spyder, you can click on the “variable explorer” above
the console to see a variable called “data_frame”. If you double-click on
“data_frame” you should see your data in a spreadsheet format.
And, that is it, with three (3) lines of code you read in a CSV ile in Python that
you normally would open up in Excel.
Now, let’s add two useful lines of code.
print(data_frame.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 11 non-null int64
1 Gender 11 non-null object
2 Country 11 non-null object
dtypes: int64(1), object(2)
memory usage: 392.0+ bytes
print(data_frame.describe())
Console output:
Age
count 11.000000
mean 33.363636
std 9.233339
min 22.000000
25% 27.000000
50% 30.000000
75% 39.500000
max 49.000000
import pandas as pd
df = pd.read_csv(“country_data.csv”)
Note: import pandas as pd and import pandas are essentially the same
module.
Column-Based Access:
Descriptive Statistics:
# Basic statistics
print(df['age'].min())
print(df['age'].max())
print(df['age'].mean())
# Filtering data
print(df[df['age'] > 30])
# Removing a column
df.drop(columns=['new_column'], inplace=True)
print(df)
Well done if you have gotten this far. Remember to try the above examples
out yourself!