DataFrame vs Series in Pandas
Last Updated :
16 Sep, 2024
Pandas is a widely-used Python library for data analysis that provides two essential data structures: Series and DataFrame. These structures are potent tools for handling and examining data, but they have different features and applications.
In this article, we will explore the differences between Series and DataFrames.
What are pandas?
Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures like DataFrame and Series, which are designed to make working with structured data fast, easy, and expressive. Pandas are widely used in data science, machine learning, and data analysis for tasks such as data cleaning, transformation, and exploration.
What is the Pandas series?
A Pandas Series is a one-dimensional array-like object that can hold data of any type (integer, float, string, etc.). It is labelled, meaning each element has a unique identifier called an index. You can think of a Series as a column in a spreadsheet or a single column of a database table. Series are a fundamental data structure in Pandas and are commonly used for data manipulation and analysis tasks. They can be created from lists, arrays, dictionaries, and existing Series objects. Series are also a building block for the more complex Pandas DataFrame, which is a two-dimensional table-like structure consisting of multiple Series objects.
Creating a Series data structure from a list, dictionary, and custom index:
Python
import pandas as pd
# Initializing a Series from a list
data = [1, 2, 3, 4, 5]
series_from_list = pd.Series(data)
print(series_from_list)
# Initializing a Series from a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
series_from_dict = pd.Series(data)
print(series_from_dict)
# Initializing a Series with custom index
data = [1, 2, 3, 4, 5]
index = ['a', 'b', 'c', 'd', 'e']
series_custom_index = pd.Series(data, index=index)
print(series_custom_index)
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
a 1
b 2
c 3
dtype: int64
a 1
b 2
c 3
d 4
e 5
dtype: int64
Key Features of Series data structure:
Indexing:
Each element in a Series has a corresponding index, which can be used to access or manipulate the data.
Python
print(series_from_list[0])
print(series_from_dict['b'])
Output:
1
2
Vectorized Operations:
Series supports vectorized operations, allowing you to perform arithmetic operations on the entire series efficiently.
Python
series_a = pd.Series([1, 2, 3])
series_b = pd.Series([4, 5, 6])
sum_series = series_a + series_b
print(sum_series)
Output:
0 5
1 7
2 9
dtype: int64
Alignment:
When performing operations between two Series objects, Pandas automatically aligns the data based on the index labels.
Python
series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5, 6], index=['b', 'c', 'd'])
sum_series = series_a + series_b
print(sum_series)
Output:
a NaN
b 6.0
c 8.0
d NaN
dtype: float64
NaN Handling:
Missing values, represented by NaN (Not a Number), can be handled gracefully in Series operations.
Python
series_a = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series_b = pd.Series([4, 5], index=['b', 'c'])
sum_series = series_a + series_b
print(sum_series)
Output:
a NaN
b 6.0
c 8.0
dtype: float64
What is Pandas Dataframe?
A Pandas DataFrame is a two-dimensional, tabular data structure with rows and columns. It is similar to a spreadsheet or a table in a relational database. The DataFrame has three main components: the data, which is stored in rows and columns; the rows, which are labeled by an index; and the columns, which are labeled and contain the actual data.
Creating a dataframe from lists, dictionary
Python
import pandas as pd
# Initializing a DataFrame from a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)
# Initializing a DataFrame from a list of lists
data = [['John', 25, 'New York'],
['Alice', 30, 'Los Angeles'],
['Bob', 35, 'Chicago']]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 35 Chicago
Key Features of Data Frame data structures:
Indexing:
DataFrame provides flexible indexing options, allowing access to rows, columns, or individual elements based on labels or integer positions.
Python
# Accessing a column
print(df['Name'])
# Accessing a row by label
print(df.loc[0])
# Accessing a row by integer position
print(df.iloc[0])
# Accessing an individual element
print(df.at[0, 'Name'])
Output:
0 John
1 Alice
2 Bob
Name: Name, dtype: object
Name John
Age 25
City New York
Name: 0, dtype: object
Name John
Age 25
City New York
Name: 0, dtype: object
John
Column Operations:
Columns in a DataFrame are Series objects, enabling various operations such as arithmetic operations, filtering, and sorting.
Python
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
# Filtering rows based on a condition
high_salary_employees = df[df['Salary'] > 60000]
print(high_salary_employees)
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)
Output:
Name Age City Salary
2 Bob 35 Chicago 70000
Name Age City Salary
2 Bob 35 Chicago 70000
1 Alice 30 Los Angeles 60000
0 John 25 New York 50000
Missing Data Handling:
DataFrames provide methods for handling missing or NaN values, including dropping or filling missing values.
Python
# Dropping rows with missing values
df.dropna()
print(df)
# Filling missing values with a specified value
df.fillna(0)
print(df)
Output:
Name Age City Salary
0 John 25 New York 50000
1 Alice 30 Los Angeles 60000
2 Bob 35 Chicago 70000
Name Age City Salary
0 John 25 New York 50000
1 Alice 30 Los Angeles 60000
2 Bob 35 Chicago 70000
Grouping and Aggregation:
DataFrames support group-by operations for summarizing data and applying aggregation functions.
Python
# Grouping by a column and calculating mean
avg_age_by_city = df.groupby('City')['Age'].mean()
print(avg_age_by_city)
Output:
City
Chicago 35.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64
DataFrame vs Series
Series | DataFrame |
---|
One- dimensional | Two- dimensional |
Series elements must be homogenous. | Can be heterogeneous. |
Immutable(size cannot be changed). | Mutable(size can be changeable). |
Element wise computations. | Column wise computations. |
Functionality is less. | Functionality is more. |
Alignment not supported. | Alignment is supported. |
Conclusion
In conclusion, Pandas offers two vital data structures, Series and DataFrame, each tailored for specific data manipulation tasks. Series excel in handling one-dimensional labeled data with efficient indexing and vectorized operations, while DataFrames provide tabular data organization with versatile indexing, column operations, and robust handling of missing data. Understanding their differences is crucial for effective data analysis in Python.
Similar Reads
NumPy Array vs Pandas Series
In the realm of data science and numerical computing in Python, two powerful tools stand out: NumPy and Pandas. These libraries play a crucial role in handling and manipulating data efficiently. Among the numerous components they offer, NumPy arrays and Pandas Series are fundamental data structures
4 min read
Combine two Pandas series into a DataFrame
In this post, we will learn how to combine two series into a DataFrame? Before starting let's see what a series is?Pandas Series is a one-dimensional labeled array capable of holding any data type. In other terms, Pandas Series is nothing but a column in an excel sheet. There are several ways to con
3 min read
Python | Pandas Dataframe/Series.dot()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas Series.dot()The dot() method is used to compute the dot product between DataFr
6 min read
Create A Set From A Series In Pandas
In Python, a Set is an unordered collection of data types that is iterable, mutable, and has no duplicate elements. The order of elements in a set is undefined though it may contain various elements. The major advantage of using a set, instead of a list, is that it has a highly optimized method for
3 min read
Python | Pandas Series/Dataframe.any()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas any() method is applicable both on Series and Dataframe. It checks whether any
3 min read
Pandas DataFrame index Property
In Pandas we have names to identify columns but for identifying rows, we have indices. The index property in a pandas dataFrame allows to identify and access specific rows within dataset. Essentially, the index is a series of labels that uniquely identify each row in the DataFrame. These labels can
6 min read
Creating a Pandas Series
A Pandas Series is like a single column of data in a spreadsheet. It is a one-dimensional array that can hold many types of data such as numbers, words or even other Python objects. Each value in a Series is associated with an index, which makes data retrieval and manipulation easy. This article exp
3 min read
Pandas Series Index Attribute
Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, etc.), with each element having an associated label known as its index. The Series.index attribute in Pandas allows users to get or set the index labels of a Series object, enhancing data ac
4 min read
Pandas DataFrame.to_sparse() Method
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure o
2 min read
Pandas DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal comp
11 min read