0% found this document useful (0 votes)
33 views13 pages

Class 6 Pandas

Uploaded by

Shreyansu Sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views13 pages

Class 6 Pandas

Uploaded by

Shreyansu Sahoo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

Python Pandas

What is Pandas ?
Pandas is a powerful data manipulation and analysis library for Python.
It provides data structures like Series and DataFrame for handling and analyzing data.

Why Pandas?
The beauty of Pandas is that it simplifies the task related to data frames and makes it
simple to do many of the time-consuming, repetitive tasks involved in working with data
frames, such as:

Import datasets - available in the form of spreadsheets, comma-separated values (CSV)


files, and more.
Data cleansing - dealing with missing values and representing them as NaN, NA, or NaT.
Size mutability - columns can be added and removed from DataFrame and higher-
dimensional objects.
Data normalization – normalize the data into a suitable format for analysis.
Data alignment - objects can be explicitly aligned to a set of labels.
Intuitive merging and joining data sets – we can merge and join datasets.
Reshaping and pivoting of datasets – datasets can be reshaped and pivoted as per the
need.
Efficient manipulation and extraction - manipulation and extraction of specific parts of
extensive datasets using intelligent label-based slicing, indexing, and subsetting
techniques.
Statistical analysis - to perform statistical operations on datasets.
Data visualization - Visualize datasets and uncover insights.

Applications of Pandas
Data Cleaning: Pandas provides functionalities to clean messy data, deal with incomplete
or inconsistent data, handle missing values, remove duplicates, and standardize formats to
do effective data analysis.
Data Exploration: Pandas easily summarizes statistics, finds trends, and visualizes data
using built-in plotting functions, Matplotlib, or Seaborn integration.
Data Preparation: Pandas may pivot, melt, convert variables, and merge datasets based
on common columns to prepare data for analysis.
Data Analysis: Pandas supports descriptive statistics, time series analysis, group-by
operations, and custom functions.
Data Visualization: Pandas itself has basic plotting capabilities; it integrates and supports
data visualization libraries like Matplotlib, Seaborn, and Plotly to create innovative
visualizations.
Time Series Analysis: Pandas supports date/time indexing, resampling, frequency
localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 1/13
9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

conversion, and rolling statistics for time series data.


Data Aggregation and Grouping: Pandas' groupby() function lets you aggregate data and
compute group-wise summary statistics or apply functions to groups.
Data Input/Output: Pandas makes data input and export easy by reading and writing CSV,
Excel, JSON, SQL databases, and more.
Machine Learning: Pandas works well with Scikit-learn for data preparation, feature
engineering, and model input data.
Web Scraping: Pandas may be used with BeautifulSoup or Scrapy to parse and analyze
structured web data for web scraping and data extraction.
Financial Analysis: Pandas is commonly used in finance for stock market data analysis,
financial indicator calculation, and portfolio optimization.
Text Data Analysis: Pandas' string manipulation, regular expressions, and text mining
functions help analyze textual data.
Experimental Data Analysis: Pandas makes manipulating and analyzing large datasets,
performing statistical tests, and visualizing results easy.

1. Installation
To install pandas, use:

!pip install pandas

2. Importing pandas

In [1]: import pandas as pd

3. Data Structures in pandas

Series

A Series is a one-dimensional array-like object containing an array of


data and an associated array of data labels called its index.

In [2]: # Creating a Series


s = pd.Series([1, 3, 5, 6, 8])
print(s)

0 1
1 3
2 5
3 6
4 8
dtype: int64

DataFrame

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular


data structure with labeled axes (rows and columns).

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 2/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [3]: # Creating a DataFrame


data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32]
}
df4 = pd.DataFrame(data)
print(df4)

Name Age
0 John 28
1 Anna 24
2 Peter 35
3 Linda 32

4. Reading Data
Pandas can read data from various file formats, including CSV, Excel, SQL databases, and
more.

In [4]: df = pd.read_csv('train.csv')

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 3/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [5]: df

Out[5]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fa

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.250
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.283
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.925
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.100
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.050
Henry

... ... ... ... ... ... ... ... ... ...

Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.000
Juozas

Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.000
Margaret
Edith

Johnston,
Miss.
W./C.
888 889 0 3 Catherine female NaN 1 2 23.450
6607
Helen
"Carrie"

Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.000
Howell

Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.750
Patrick

891 rows × 12 columns

In [6]: df1 = pd.read_excel('Class 1.xlsx', sheet_name='Sheet1')

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 4/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [7]: df1

Out[7]: Date Category Item Amount Major Expenses

0 2024-06-01 Food sandwich 10 NO

1 2024-06-01 Food samosa 20 NO

2 2024-06-01 Food Groceries 50 NO

3 2024-06-01 Utilities Phone Bill 100 YES

4 2024-06-01 Utilities Gas bill 200 YES

5 2024-06-01 Rent Home Rent 500 YES

6 2024-06-01 Utilities Water Bill 20 NO

7 2024-06-02 Food sandwich 100 YES

8 2024-06-02 Food samosa 200 YES

5. Data Inspection
You can inspect your data using various methods:

In [8]: # Display the first five rows


df.head()

Out[8]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 5/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [9]: # Display the last five rows


df.tail()

Out[9]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare C

Montvila,
886 887 0 2 Rev. male 27.0 0 0 211536 13.00
Juozas

Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053 30.00
Margaret
Edith

Johnston,
Miss.
W./C.
888 889 0 3 Catherine female NaN 1 2 23.45
6607
Helen
"Carrie"

Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369 30.00 C
Howell

Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376 7.75
Patrick

In [10]: # Display the DataFrame's shape (rows, columns)


df.shape

Out[10]: (891, 12)

In [11]: # Get a concise summary of the DataFrame


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 6/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [12]: # Display basic statistics


df.describe()

Out[12]: PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

6. Data Selection and Filtering

Selecting columns

In [13]: # Select a single column


df1['Category']

Out[13]: 0 Food
1 Food
2 Food
3 Utilities
4 Utilities
5 Rent
6 Utilities
7 Food
8 Food
Name: Category, dtype: object

In [14]: # Select multiple columns


df1[['Category', 'Item']]

Out[14]: Category Item

0 Food sandwich

1 Food samosa

2 Food Groceries

3 Utilities Phone Bill

4 Utilities Gas bill

5 Rent Home Rent

6 Utilities Water Bill

7 Food sandwich

8 Food samosa

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 7/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

Selecting rows

In [15]: # Select rows by index


print(df1.iloc[0]) # First row

Date 2024-06-01 00:00:00


Category Food
Item sandwich
Amount 10
Major Expenses NO
Name: 0, dtype: object

In [16]: print(df1.iloc[1:4]) # Second to third row

Date Category Item Amount Major Expenses


1 2024-06-01 Food samosa 20 NO
2 2024-06-01 Food Groceries 50 NO
3 2024-06-01 Utilities Phone Bill 100 YES

In [17]: # Select rows by label


print(df1.loc[0]) # First row

Date 2024-06-01 00:00:00


Category Food
Item sandwich
Amount 10
Major Expenses NO
Name: 0, dtype: object

In [18]: print(df1.loc[1:4]) # Second to third row (inclusive)

Date Category Item Amount Major Expenses


1 2024-06-01 Food samosa 20 NO
2 2024-06-01 Food Groceries 50 NO
3 2024-06-01 Utilities Phone Bill 100 YES
4 2024-06-01 Utilities Gas bill 200 YES

Filtering data

In [19]: # Filter rows based on a condition


filtered_df = df1[df1['Amount'] <51]
print(filtered_df)

Date Category Item Amount Major Expenses


0 2024-06-01 Food sandwich 10 NO
1 2024-06-01 Food samosa 20 NO
2 2024-06-01 Food Groceries 50 NO
6 2024-06-01 Utilities Water Bill 20 NO

7. Data Manipulation

Adding new columns

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 8/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [20]: df1['NewAmount'] = df1['Amount'] + 10


print(df1)

Date Category Item Amount Major Expenses NewAmount


0 2024-06-01 Food sandwich 10 NO 20
1 2024-06-01 Food samosa 20 NO 30
2 2024-06-01 Food Groceries 50 NO 60
3 2024-06-01 Utilities Phone Bill 100 YES 110
4 2024-06-01 Utilities Gas bill 200 YES 210
5 2024-06-01 Rent Home Rent 500 YES 510
6 2024-06-01 Utilities Water Bill 20 NO 30
7 2024-06-02 Food sandwich 100 YES 110
8 2024-06-02 Food samosa 200 YES 210

Dropping columns

In [21]: df1 = df1.drop(columns=['NewAmount'])


print(df1)

Date Category Item Amount Major Expenses


0 2024-06-01 Food sandwich 10 NO
1 2024-06-01 Food samosa 20 NO
2 2024-06-01 Food Groceries 50 NO
3 2024-06-01 Utilities Phone Bill 100 YES
4 2024-06-01 Utilities Gas bill 200 YES
5 2024-06-01 Rent Home Rent 500 YES
6 2024-06-01 Utilities Water Bill 20 NO
7 2024-06-02 Food sandwich 100 YES
8 2024-06-02 Food samosa 200 YES

Renaming columns

In [22]: df1 = df1.rename(columns={'Amount': 'My_Amount'})


print(df1)

Date Category Item My_Amount Major Expenses


0 2024-06-01 Food sandwich 10 NO
1 2024-06-01 Food samosa 20 NO
2 2024-06-01 Food Groceries 50 NO
3 2024-06-01 Utilities Phone Bill 100 YES
4 2024-06-01 Utilities Gas bill 200 YES
5 2024-06-01 Rent Home Rent 500 YES
6 2024-06-01 Utilities Water Bill 20 NO
7 2024-06-02 Food sandwich 100 YES
8 2024-06-02 Food samosa 200 YES

8. Handling Missing Data

Checking for missing values

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 9/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [23]: print(df.isnull().sum())

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

In [24]: print(df.notnull().sum())

PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 714
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 204
Embarked 889
dtype: int64

Dropping missing values

In [25]: df2 = df.dropna()

In [26]: print(df2.isnull().sum())

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 10/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [27]: print(df2.notnull().sum())

PassengerId 183
Survived 183
Pclass 183
Name 183
Sex 183
Age 183
SibSp 183
Parch 183
Ticket 183
Fare 183
Cabin 183
Embarked 183
dtype: int64

Filling missing values

In [28]: df3 = df.fillna(0)

In [29]: print(df3.isnull().sum())

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
dtype: int64

In [30]: print(df3.notnull().sum())

PassengerId 891
Survived 891
Pclass 891
Name 891
Sex 891
Age 891
SibSp 891
Parch 891
Ticket 891
Fare 891
Cabin 891
Embarked 891
dtype: int64

9. Grouping and Aggregating Data

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 11/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [31]: # Group by a column and calculate the sum


grouped_df = df4.groupby('Age').sum()
grouped_df

Out[31]: Name

Age

24 Anna

28 John

32 Linda

35 Peter

In [32]: # Group by a column and calculate the sum


grouped_df = df4.groupby(['Age', 'Name']).mean()
grouped_df

Out[32]:
Age Name

24 Anna

28 John

32 Linda

35 Peter

10. Merging and Joining DataFrames

Merging DataFrames

In [33]: df5 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


'value': [1, 2, 3, 4]})
df6 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]})

In [34]: merged_df = pd.merge(df5, df6, on='key')


merged_df

Out[34]: key value_x value_y

0 B 2 5

1 D 4 6

Joining DataFrames

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 12/13


9/28/24, 3:08 PM Class 6 Pandas - Jupyter Notebook

In [35]: left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


'value': [1, 2, 3, 4]})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]})

joined_df = left.join(right.set_index('key'), on='key', lsuffix='_left', rs
joined_df

Out[35]: key value_left value_right

0 A 1 NaN

1 B 2 5.0

2 C 3 NaN

3 D 4 6.0

11. Saving Data

# Saving to a CSV file


df.to_csv('filename.csv', index=False)

# Saving to an Excel file


df.to_excel('filename.xlsx', index=False)

localhost:8889/notebooks/Python/Python/Class 6 Pandas.ipynb 13/13

You might also like