How to utilise Pandas dataframe and series for data wrangling?
Last Updated :
21 Mar, 2024
In this article, we are going to see how to utilize Pandas DataFrame and series for data wrangling.
The process of cleansing and integrating dirty and complicated data sets for easy access and analysis is known as data wrangling. As the amount of data raises continually and expands, it is becoming more important to organize vast amounts of data for analysis. Data wrangling comprises activities such as data sorting, data filtering, data reduction, data access, and data processing. Data wrangling is one of the most important tasks in data science and data analysis. Let's see how to utilize Pandas DataFrame and series for data wrangling.
Utilize series for data wrangling
Creating Series
pd.Series() method is used to create a pandas Series. In this, a list is given as an argument and we use the index parameter to set the index of the Series. The index helps us to retrieve data based on conditions.
Python3
# importing packages
import pandas as pd
# creating a series
population_data = pd.Series([1440297825, 1382345085,
331341050,
274021604, 212821986],
index=['China', 'India',
'United States',
'Indonesia', 'Brazil'])
print(population_data)
Output:
Filtering data- Retrieving insights based on conditions from the data
From the previous data, we retrieve data on two conditions, one is the population of India and another is countries that have a population of more than a billion.
Python3
print('population of india is : \
' + str(population_data['India']))
print('population greater than a billion :')
print(population_data[population_data > 1000000000])
Output:
We can also use dictionaries to create Series in python. In this, we have to pass a Dictionary as an argument in the pd.Series() method.
Python3
population_data = pd.Series({'China': 1440297825,
'India': 1382345085,
'United States': 331341050,
'Indonesia': 274021604,
'Brazil': 212821986})
print(population_data)
Output:
Changing indices by altering the index of series
In pd.Series the index can be manipulated or altered by specifying a new index series.
Python3
population_data.index = ['CHINA','INDIA',
'US','INDONESIA',
'BRAZIL']
print(population_data)
Output:
Utilize Pandas Dataframe for data wrangling
In this example, we will use a CSV file to print top n (5 by default) rows of a DataFrame or series using the Pandas.head() method.
Python3
# importing packages
import pandas as pd
# loading csv data
population_data = pd.read_csv('employees.csv')
# setting the index of the dataframe
population_data=population_data.set_index('First Name')
# head of the dataframe
population_data.head()
Output:
Describing DataFrame
pd.Describe() method is used to get the summary statistics of the Dataframe.
Python3
# importing packages
import pandas as pd
# loading csv data
population_data = pd.read_csv('employees.csv')
population_data.describe()
Output:
Setting and Resetting the index of the Dataframe
pd.set_index is used for setting and resetting the index of the Dataframe. Whereas, pd.reset_index() reverts the Dataframe back to the normal state. Here, the name of the column is given as an argument.
Example 1: Resetting the index of the Dataframe in Start Date columns
Python3
# importing packages
import pandas as pd
# creating a pandas Dataframe
population_data = pd.read_csv('employees.csv')
# setting the index of the dataframe
population_data = population_data.set_index('Start Date')
pd.DataFrame(population_data)
Output:
Example 2: Resetting the index of the Dataframe in First Name columns
Python3
# importing packages
import pandas as pd
# creating a pandas Dataframe
population_data = pd.read_csv('employees.csv')
# resetting the index of the dataframe
population_data = population_data.reset_index()
population_data = population_data.set_index('First Name')
pd.DataFrame(population_data)
Output:
Deleting a column from the DataFrames
The column 'Salary' is deleted from the DataFrames from our CSV file.
Python3
# importing packages
import pandas as pd
# loading csv data
population_data = pd.read_csv('employees.csv')
# deleting column
del population_data['Salary']
pd.DataFrame(population_data.head())
Output:
df.Transpose() function is used to find the transpose of the given DataFrame.
Python3
# importing packages
import pandas as pd
# loading csv data
population_data = pd.read_csv('employees.csv')
# setting the index of the dataframe
population_data=population_data.set_index('First Name')
# displaying a transpose of the dataframe
pd.DataFrame(population_data.transpose().head())
Output:
df.sort_values() function is used to sort data. In this, the column name is passed as a parameter.
Python3
# importing packages
import pandas as pd
# loading csv data
population_data = pd.read_csv('employees.csv')
# setting the index of the dataframe
population_data=population_data.set_index('First Name')
# sorting the Dataframe based on Density of population per km column
sorted_dataframe = population_data.sort_values('Salary', ascending=False)
pd.DataFrame(sorted_dataframe)
Output:
Missing or null values can be checked with the Pandas df.null() method.
Python3
# importing packages
import pandas as pd
# loading csv data
data = pd.read_csv('employees.csv')
# checking for null values
data.isnull().sum()
Output:
We can filter rows that have null values by using df.dropna() method.
Python3
# importing packages
import pandas as pd
# loading csv data
data = pd.read_csv('employees.csv')
# dropping NA values
data = data.dropna(axis=0, how='any')
# checking for null values
data.isnull().sum()
Output:
In Data Analysis, the Grouping of data sets is a common requirement when the outcome must be expressed in terms of many groups. Panadas provides us with a built-in mechanism for grouping data into several categories. The pandas' df.groupby() technique is used for grouping data.
In the below code, We will create a DataFrames of students and their grades. In this groupby() method is used to group students according to their grades with their names.
Python3
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.read_csv("employees.csv")
# First grouping based on "Team"
# Within each team we are grouping based on "Position"
data = df.groupby(['First Name', 'Gender'])
# Print the first value in each group
data.first()
Output:
Pandas df.merge() method is used to merge two DataFrames. There are different ways of merging DataFrames like, outer join, inner join, left join, right join, etc.
Python3
import pandas as pd
# reading two csv files
data = pd.read_csv('employees.csv')
# creating two dataframe
head_data = data.head()
tail_data = data.tail()
# get top 5 rows
print("Head Data :")
display(head_data)
# get last 5 rows
print("Tail Data :")
display(tail_data)
# merge dataframe
merge_data = pd.merge(head_data, tail_data, how='outer' )
print("After merging: ")
display(merge_data)
Output:
The Concat function is used to conduct concatenation operations along an axis. Let's create two DataFrames and concatenate them.
Python3
import pandas as pd
# reading two csv files
data1 = pd.read_csv('employees.csv')
data2 = pd.read_csv('borrower.csv')
# concatenating the dataframes
pd.DataFrame(pd.concat([data1,data2]))
Output:
Similar Reads
How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we're working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed.
3 min read
How to Plot a Dataframe using Pandas
Pandas plotting is an interface to Matplotlib, that allows to generate high-quality plots directly from a DataFrame or Series. The .plot() method is the core function for plotting data in Pandas. Depending on the kind of plot we want to create, we can specify various parameters such as plot type (ki
8 min read
How to Convert Pandas DataFrame columns to a Series?
It is possible in pandas to convert columns of the pandas Data frame to series. Sometimes there is a need to converting columns of the data frame to another type like series for analyzing the data set. Case 1: Converting the first column of the data frame to Series Python3 # Importing pandas module
2 min read
How to Convert Wide Dataframe to Tidy Dataframe with Pandas stack()?
We might sometimes need a tidy/long-form of data for data analysis. So, in python's library Pandas there are a few ways to reshape a dataframe which is in wide form into a dataframe in long/tidy form. Here, we will discuss converting data from a wide form into a long-form using the pandas function s
4 min read
How to Plot Multiple Series from a Pandas DataFrame?
In this article, we will discuss how to plot multiple series from a dataframe in pandas. Series is the range of the data  that include integer points we cab plot in pandas dataframe by using plot() function Syntax: matplotlib.pyplot(dataframe['column_name']) We can place n number of series and we ha
2 min read
Clean the string data in the given Pandas Dataframe
In today's world data analytics is being used by all sorts of companies out there. While working with data, we can come across any sort of problem which requires an out-of-the-box approach for evaluation. Most of the Data in real life contains the name of entities or other nouns. It might be possibl
3 min read
Creating a Pandas dataframe using list of tuples
A Pandas DataFrame is an important data structure used for organizing and analyzing data in Python. Converting a list of tuples into a DataFrame makes it easier to work with data. In this article we'll see ways to create a DataFrame from a list of tuples.1. Using pd.DataFrame()The simplest method to
2 min read
How to add metadata to a DataFrame or Series with Pandas in Python?
Metadata, also known as data about the data. Metadata can give us data description, summary, storage in memory, and datatype of that particular data. We are going to display and create metadata. Scenario: We can get metadata simply by using info() commandWe can add metadata to the existing data and
3 min read
Combine two Pandas series into a DataFrame
In this post, we will learn how to combine two series into a DataFrame? Before starting let's see what a series is?Pandas Series is a one-dimensional labeled array capable of holding any data type. In other terms, Pandas Series is nothing but a column in an excel sheet. There are several ways to con
3 min read
Reshape a Pandas DataFrame using stack,unstack and melt method
Pandas use various methods to reshape the dataframe and series. Reshaping a Pandas DataFrame is a common operation to transform data structures for better analysis and visualization. The stack method pivots columns into rows, creating a multi-level index Series. Conversely, the unstack method revers
5 min read