Pandas
Pandas
1. Installation
You can install Pandas using pip:
bashCopy code
pip install pandas
2. Importing Pandas
Pandas is typically imported with the alias pd :
pythonCopy code
import pandas as pd
3.1 Series
A Series is a one-dimensional labeled array capable of holding any data type.
pythonCopy code
s = pd.Series([1, 3, 5, 7, 9])
Pandas 1
print(s)
pythonCopy code
s = pd.Series([1, 3, 5, 7], index=['a', 'b', 'c', 'd'])
print(s)
3.2 DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous
tabular data structure with labeled axes (rows and columns).
You can create a DataFrame from a dictionary:
pythonCopy code
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)
print(df)
4. DataFrame Operations
pythonCopy code
df.head() # By default, it shows the first 5 rows
Pandas 2
pythonCopy code
df.tail(3) # Shows the last 3 rows
pythonCopy code
df.shape
Column names:
pythonCopy code
df.columns
pythonCopy code
df.dtypes
pythonCopy code
df['Name']
pythonCopy code
df[['Name', 'Age']]
Pandas 3
Select rows by label using .loc[] :
pythonCopy code
df.loc[0] # Select the row with index 0
df.loc[0:2, ['Name', 'City']] # Select rows 0 to 2 and colum
ns 'Name' and 'City'
pythonCopy code
df.iloc[0] # First row
df.iloc[0:2, 0:2] # First two rows and first two columns
pythonCopy code
df[df['Age'] > 30]
5. Data Cleaning
pythonCopy code
df.isnull().sum() # Sum of null values in each column
Pandas 4
pythonCopy code
df.dropna() # Drop rows with missing values
pythonCopy code
df.fillna(value=0) # Replace NaN with 0
pythonCopy code
df.rename(columns={'Name': 'Full Name'}, inplace=True)
pythonCopy code
df['Age'] = df['Age'].astype(float) # Convert Age to float
6. Data Transformation
pythonCopy code
df['Country'] = ['USA', 'France', 'Germany', 'UK']
Pandas 5
pythonCopy code
df.drop(['City'], axis=1, inplace=True) # Drop the 'City' co
lumn
pythonCopy code
df.sort_values(by='Age', ascending=False) # Sort by 'Age' in
descending order
pythonCopy code
df['Age'] = df['Age'].apply(lambda x: x + 1)
pythonCopy code
grouped = df.groupby('Country')['Age'].mean()
Pandas 6
print(grouped)
pythonCopy code
grouped = df.groupby(['Country', 'City'])['Age'].mean()
print(grouped)
Multiple aggregations:
pythonCopy code
df.groupby('Country').agg({'Age': ['mean', 'sum'], 'Name': 'c
ount'})
Row-wise concatenation:
pythonCopy code
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
pd.concat([df1, df2], axis=0)
Pandas 7
Column-wise concatenation:
pythonCopy code
pd.concat([df1, df2], axis=1)
pythonCopy code
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2,
3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5,
6]})
merged_df = pd.merge(df1, df2, on='key', how='inner')
Left join:
pythonCopy code
pd.merge(df1, df2, on='key', how='left')
Outer join:
pythonCopy code
pd.merge(df1, df2, on='key', how='outer')
Pandas 8
Pandas can read data from a variety of file formats, with CSV being the most
common.
pythonCopy code
df = pd.read_csv('data.csv')
pythonCopy code
df.to_csv('output.csv', index=False)
pythonCopy code
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
pythonCopy code
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
pythonCopy code
dates = pd.date_range('2024-01-01', periods=6, freq='D')
Pandas 9
df = pd.DataFrame({'Date': dates, 'Value': [1, 2, 3, 4, 5,
6]})
pythonCopy code
df.set_index('Date', inplace=True)
10.3 Resampling
You can resample time series data (e.g., daily to monthly).
pythonCopy code
df.resample('M').mean() # Resample to monthly data, taking t
he mean of each month
pythonCopy code
df['2024-01-01':'2024-01-03'] # Filter rows within this date
range
Pandas 10
pythonCopy code
pivot = df.pivot_table(values='Age', index='Country', columns
='City', aggfunc='mean')
print(pivot)
pythonCopy code
df.plot(x='Date', y='Value')
pythonCopy code
df.plot(kind='bar', x='Name', y='Age')
12.3 Histogram
pythonCopy code
df['Age'].plot(kind='hist')
pythonCopy code
df.plot(kind='scatter', x='Age', y='Value')
Pandas 11
13. Advanced Topics
13.1 MultiIndex
You can work with multiple levels of indexing (hierarchical indexing) in Pandas.
pythonCopy code
arrays = [['bar', 'bar', 'baz', 'baz'], ['one', 'two', 'one',
'two']]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'se
cond'))
df = pd.DataFrame({'A': [1, 2, 3, 4]}, index=index)
pythonCopy code
df.loc['bar', 'one']
pythonCopy code
chunk_size = 1000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_si
ze):
process(chunk) # You can process each chunk separately
Pandas 12
Categorical data is a type of data with a fixed number of possible values
(categories).
pythonCopy code
df['Gender'] = df['Gender'].astype('category')
Pandas 13