Basic Data Processing with Pandas
Basic Data Processing with Pandas
Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
s = pd. Series ([4, 7, -5, 3], index= ['a', 'b', 'c', 'd'])
# Positional indexing
print(s.iloc[0]) # Output: 4
# Label-based indexing
print(s['b']) # Output: 7
# Boolean indexing
print(s[s > 0]) # Output: a 4
# b 7
# d 3
Vectorised Operations
Pandas Series supports vectorised operations, which means operations are applied to each
element of the Series without the need for an explicit loop.
s = pd. Series ([1, 2, 3, 4])
# Arithmetic operations
print (s + 5) # Output: Series with each element incremented by 5
# Element-wise operations
print (s * 2) # Output: Series with each element multiplied by 2
Alignment of Data
When performing operations between two Series, Pandas automatically aligns the data based
on the index. If an index is missing, the result will have NaN for that index.
s1 = pd. Series ([1, 2, 3], index= ['a', 'b', 'c'])
s2 = pd. Series ([4, 5, 6], index= ['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
Pandas Series has built-in methods to handle missing data (NaN values).
Series Methods
Series objects come with a variety of methods for data manipulation and analysis:
s = pd.Series([5, 3, 8, 2])
print(s.sum()) # Output: 18
print(s.mean()) # Output: 4.5
print(s.sort_values()) # Output: Sorted Series
3 2
1 3
0 5
2 8
print(s.rank()) # Output: Series with ranks
0 3.0
1 2.0
2 4.0
3 1.0
print(s.apply(lambda x: x**2)) # Output: Series with squared values
0 25
1 9
2 64
3 4
Pandas Series is well-suited for handling time series data. You can use date ranges as the
index to create a time series.
pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')
print(s)
2023-01-01 1
2023-01-02 3
2023-01-03 5
2023-01-04 7
2023-01-05 9
2023-01-06 11
Querying a Series
Querying a Series in Pandas involves selecting and filtering data based on various conditions.
Accessing Elements by Index: You can access elements in a Series using the index, either
by position (integer-based) or by label.
import pandas as pd
s = pd. Series ([10, 20, 30, 40], index= ['a', 'b', 'c', 'd'])
# Access the first element
print(s[0]) # Output: 10
# Access the third element
print(s[2]) # Output: 30
print(s[:3]) # Output: a 10
b 20
c 30
Boolean Indexing: Boolean indexing allows you to filter a Series based on a condition.
# Query elements greater than 20
d 40
You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).
c 30
Using isin() Method: The isin() method is used to filter data based on whether the elements
are in a list of values.
c 30
print(s.loc['b':'d']) # Output: b 20
c 30
d 40
print(s.iloc[1:3]) # Output: b 20
c 30
Querying with Conditional Functions: You can also use functions like .where() and
.query() for conditional selection.
b NaN
c 30.0
d 40.0
.query(): Though primarily used for DataFrames, it can be used with Series in certain contexts
when converting to DataFrame temporarily.
df = s.to_frame(name='value')
print(result)
Handling Missing Data: While querying, you might encounter missing data (NaN).
2 30.0
3 40.0
2 30.0
3 40.0
Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
import pandas as pd
# create a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris
df = pd.read_csv('data.csv')
print(df)
df = pd.DataFrame()
Columns: []
Index: []
For negative values of n, this function returns all rows except the last |n| rows
If n is larger than the number of rows, this function returns all rows.
Parameters:
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
For negative values of n, this function returns all rows except the first n rows.
If n is larger than the number of rows, this function returns all rows.
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
7 whale
8 zebra
>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
Column addition: In Order to add a column in Pandas DataFrame, we can declare a new
list as a column and add to existing Dataframe.
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}
Selecting Rows
By label: df.loc[label]: Label-based data selector. The end index is included during
slicing.
By index: df.iloc[index]: Index-based data selector. The end index is excluded during
slicing.
left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
right: use only keys from right frame, similar to a SQL right outer
join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves
the order of the left keys.
on: label or list: Column(s) to join on. If not specified, the merge will be on common
columns.
Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then
this defaults to the intersection of the columns in both DataFrames.
left_on: Column(s) from the left DataFrame to use as keys.
right_on: Column(s) from the right DataFrame to use as keys.
suffixes: Suffix to apply to overlapping column names in the left and right side.
Pivot Table
Pivot tables in Pandas are powerful tools for summarizing data, allowing you to aggregate and
reshape your data easily. The pivot_table() function in Pandas is similar to Excel's pivot tables.
Func: Function to use for aggregating the data. list of functions and/or
function names, e.g. [‘sum’, 'mean']
>>>df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN
DataFrame.groupby: The groupby() method is used to group data based on one or more
columns, and then you can apply aggregation functions such as mean(), sum(), count(), etc.
Syntax:
df.groupby('column_name').aggregate_function()
Example:
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Employee': ['John', 'Anna', 'Mike', 'Sara', 'Paul', 'Kate'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000]
}
df = pd.DataFrame(data)
# Group by Department and calculate the mean salary for each department
grouped_df = df.groupby('Department')['Salary'].mean()
print(grouped_df)
output:
Department
Finance 70000.0
HR 55000.0
IT 75000.0
Name: Salary, dtype: float64
Multiple Aggregations:
print(grouped_df)
output:
print(grouped_df)
output:
Department Employee
Paul 75000
HR Anna 60000
John 50000
IT Mike 70000
Sara 80000
Name: Salary, dtype: int64
The .describe() method provides a summary of statistics like count, mean, standard
deviation, min, and max for numeric columns.
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000],
'Years_Experience': [5, 7, 3, 10, 8, 6]
}
df = pd.DataFrame(data)
# Summary statistics
summary_table = df.describe()
print(summary_table)
output:
You can use the groupby() function combined with aggregation methods to generate
summary tables based on categorical columns.
print(grouped_summary)
output:
Salary Years_Experience
mean sum min max mean sum min max
Department
Finance 70000.0 140000 65000 75000 7.0 14 6 8
HR 55000.0 110000 50000 60000 6.0 12 5 7
IT 75000.0 150000 70000 80000 6.5 13 3 10
3. Crosstab Summary: pd.crosstab()
print(crosstab_summary)
output:
Years_Experience 3 5 6 7 8 10
Department
Finance 0 0 1 0 1 0
HR 0 1 0 1 0 0
IT 1 0 0 0 0 1
In this case, it shows the number of employees in each department with various years of experience.
Pivot tables provide more flexibility by allowing you to compute aggregated values across
multiple dimensions.
print(pivot_summary)
output:
Years_Experience 3 5 6 7 8 10
Department
If you need a highly customized summary, you can use .apply() to create a summary table
with user-defined functions.
def salary_range(x):
custom_summary = df.groupby('Department').Salary.apply(salary_range)
print(custom_summary)
output:
Department
Finance 10000
HR 10000
IT 10000
Name: Salary, dtype: int64