DataFrame vs Series in Pandas

Last Updated : 16 Sep, 2024

Pandas is a widely-used Python library for data analysis that provides two essential data structures: Series and DataFrame. These structures are potent tools for handling and examining data, but they have different features and applications.

In this article, we will explore the differences between Series and DataFrames.

Table of Content

What are pandas?
What is the Pandas series?
Key Features of Series data structure:
What is Pandas Dataframe?
Key Features of Data Frame data structures:
DataFrame vs Series

What are pandas?

Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures like DataFrame and Series, which are designed to make working with structured data fast, easy, and expressive. Pandas are widely used in data science, machine learning, and data analysis for tasks such as data cleaning, transformation, and exploration.

What is the Pandas series?

A Pandas Series is a one-dimensional array-like object that can hold data of any type (integer, float, string, etc.). It is labelled, meaning each element has a unique identifier called an index. You can think of a Series as a column in a spreadsheet or a single column of a database table. Series are a fundamental data structure in Pandas and are commonly used for data manipulation and analysis tasks. They can be created from lists, arrays, dictionaries, and existing Series objects. Series are also a building block for the more complex Pandas DataFrame, which is a two-dimensional table-like structure consisting of multiple Series objects.

Creating a Series data structure from a list, dictionary, and custom index:

Python

import pandas as pd

# Initializing a Series from a list
data = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data)
print(series_from_list)

# Initializing a Series from a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
series_from_dict = pd.Series(data)
print(series_from_dict)

# Initializing a Series with custom index
data = [1, 2, 3, 4, 5]
index = ['a', 'b', 'c', 'd', 'e']
series_custom_index = pd.Series(data, index=index)
print(series_custom_index)

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64

Key Features of Series data structure:

Indexing:

Each element in a Series has a corresponding index, which can be used to access or manipulate the data.

Python

print(series_from_list[0]) 
print(series_from_dict['b'])

Output:

1
2

Vectorized Operations:

Series supports vectorized operations, allowing you to perform arithmetic operations on the entire series efficiently.

Python

series_a = pd.Series([1, 2, 3])
series_b = pd.Series([4, 5, 6])
sum_series = series_a + series_b 
print(sum_series)

Output:

0    5
1    7
2    9
dtype: int64

Alignment:

When performing operations between two Series objects, Pandas automatically aligns the data based on the index labels.

Python

series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
sum_series = series_a + series_b 
print(sum_series)

Output:

a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

NaN Handling:

Missing values, represented by NaN (Not a Number), can be handled gracefully in Series operations.

Python

series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5], index=['b', 'c'])
sum_series = series_a + series_b 
print(sum_series)

Output:

a    NaN
b    6.0
c    8.0
dtype: float64

What is Pandas Dataframe?

A Pandas DataFrame is a two-dimensional, tabular data structure with rows and columns. It is similar to a spreadsheet or a table in a relational database. The DataFrame has three main components: the data, which is stored in rows and columns; the rows, which are labeled by an index; and the columns, which are labeled and contain the actual data.

Creating a dataframe from lists, dictionary

Python

import pandas as pd

# Initializing a DataFrame from a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)

# Initializing a DataFrame from a list of lists
data = [['John', 25, 'New York'],
        ['Alice', 30, 'Los Angeles'],
        ['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)

Output:

    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago
    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago

Key Features of Data Frame data structures:

Indexing:

DataFrame provides flexible indexing options, allowing access to rows, columns, or individual elements based on labels or integer positions.

Python

# Accessing a column
print(df['Name'])

# Accessing a row by label
print(df.loc[0])

# Accessing a row by integer position
print(df.iloc[0])

# Accessing an individual element
print(df.at[0, 'Name'])

Output:

0     John
1    Alice
2      Bob
Name: Name, dtype: object
Name        John
Age           25
City    New York
Name: 0, dtype: object
Name        John
Age           25
City    New York
Name: 0, dtype: object
John

Column Operations:

Columns in a DataFrame are Series objects, enabling various operations such as arithmetic operations, filtering, and sorting.

Python

# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Filtering rows based on a condition
high_salary_employees = df[df['Salary'] &gt; 60000]
print(high_salary_employees)

# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

Output:

  Name  Age     City  Salary
2  Bob   35  Chicago   70000
    Name  Age         City  Salary
2    Bob   35      Chicago   70000
1  Alice   30  Los Angeles   60000
0   John   25     New York   50000

Missing Data Handling:

DataFrames provide methods for handling missing or NaN values, including dropping or filling missing values.

Python

# Dropping rows with missing values
df.dropna()
print(df)

# Filling missing values with a specified value
df.fillna(0)
print(df)

Output:

    Name  Age         City  Salary
0   John   25     New York   50000
1  Alice   30  Los Angeles   60000
2    Bob   35      Chicago   70000
    Name  Age         City  Salary
0   John   25     New York   50000
1  Alice   30  Los Angeles   60000
2    Bob   35      Chicago   70000

Grouping and Aggregation:

DataFrames support group-by operations for summarizing data and applying aggregation functions.

Python

# Grouping by a column and calculating mean
avg_age_by_city = df.groupby('City')['Age'].mean()
print(avg_age_by_city)

Output:

City
Chicago        35.0
Los Angeles    30.0
New York       25.0
Name: Age, dtype: float64