Exp1 - Manipulating Datasets Using Pandas
Exp1 - Manipulating Datasets Using Pandas
Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data.
Pandas allows us to analyze big data and draw conclusions based on statistical
theories. Pandas can clean up messy data sets and make them readable and relevant.
Relevant data is very important in data science.
Pandas Installation: If you have Python and PIP already installed on a system, then
installing Pandas is very easy. Install it using this command:
Import Pandas: Once Pandas is installed, import it into your applications by adding
the import keyword:
import pandas
Pandas as pd: Pandas is usually imported under the pd alias. alias: In Python, an alias
is an alternate name for referring to the same thing. Create an alias with the as
keyword while importing:
import pandas as pd
The primary two components of Pandas are the Series and DataFrame. A Series is
essentially a column, and a DataFrame is a multi-dimensional table made up of a
collection of Series. DataFrames and Series are quite similar in that there are many
operations that you can do with one that you can do with the other, such as filling in
null values and calculating the mean.
Pandas Series can be created from lists, dictionaries, scalar values, etc. In the real
world, a Pandas Series will be created by loading the datasets from existing storage,
which can be a SQL Database, a CSV file, or an Excel file. Here are some ways in which
we create a series:
Example 1: Create a simple Pandas Series from a list:
In order to create a series from a list, we have to first create a list. After that, we can
create a series from a list.
import pandas as pd
print(my_Series)
Regarding labels: By default, the values in the series are labeled with their index
number. The first value has index 0, second value has index 1, etc. This label can be
used to access a specified value.
In order to create a series from the dictionary, we have to first create a dictionary. After
that, we can make a series using the dictionary. Dictionary keys are used to construct
indexes for series.
import pandas as pd
# a simple dictionary
calories = {"day1": 420, "day2": 380, "day3": 390}
print(my_Series)
Pandas DataFrame can be created in multiple ways: from lists, from lists of lists, from dict
of lists, from lists of dicts, from series, from files. etc. Note that the DataFrame() function
is used to create a DataFrame in Pandas.
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion
in rows and columns. Pandas DataFrame consists of three principal components, the data,
rows, and columns. Pandas DataFrame can be created from lists, dictionaries, lists of
dictionaries, etc. In the real world, a Pandas DataFrame will be created by loading the
datasets from existing storage, which can be a SQL Database, CSV file, or an Excel file. To
create a DataFrame. the .DataFrame() method will be used. The syntax for using the
.DataFrame() method is as follows:
1
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None,
copy=None)
import pandas as pd
# creates Dataframe.
df = pd.DataFrame(d)
import pandas as pd
# Create DataFrame
df = pd.DataFrame(data)
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
2
Dataframes can also be created from Comma Separated Files (CSV). Therefore, If your
data sets are stored in a file, Pandas can load them into a DataFrame. The first step in
creating a DataFrame from a CSV file is to read the file into Python. This can be done
using the `pandas` library, which provides a simple way to read CSV files as
DataFrames.
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
It's worth noting that the `read_csv` function has many optional parameters that can
be used to customize how the CSV file is read. For example, you can specify the
delimiter used in the file (in case it's not a comma), the encoding, and whether or not
the file contains a header row.
Viewing your data: The first thing to do when opening a new dataset is print out a few
rows to keep as a visual reference. We accomplish this with .head() function.
Example 7: using .head() function to print the first five rows of the data set
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
The .head() will output the first five rows of your DataFrame by default, but we could
also pass a number. For example: df.head(10) would output the top ten rows. To see the
last five rows, use .tail(). The .tail() also accepts a number, and in this case we are
printing the bottom rows:
Example 8: using .tail() function to print the last five rows of the data set
import pandas as pd
df = pd.read_csv('data.csv')
print(df.tail())
DataFrame Shape: The shape of a DataFrame is a tuple of array dimensions that tells
the number of rows and columns of a given DataFrame. The DataFrame.shape
attribute in Pandas enables us to obtain the shape of a DataFrame. For example, if a
3
DataFrame has a shape of (80, 10) , this implies that the DataFrame is made up of 80
rows and 10 columns of data.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.shape)
Information About the Data: The DataFrames object has a method called info(), that
gives you more information about the data set. The .info() method provides the
essential details about your dataset, such as the number of rows and columns, the
number of non-null values, what type of data is in each column, and how much memory
your DataFrame is using.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.info())
When you create a DataFrame in Pandas, the DataFrame will automatically have
certain properties. Specifically, each row and each column will have an integer
“location” in the DataFrame. These integer locations for the rows and columns start at
zero. So the first column will have an integer location of 0, the second column will have
an integer location of 1, and so on. The same numbering pattern applies to the rows.
We can use these numeric indexes to retrieve specific rows and columns.
To select a subset from a DataFrame, we use the indexing operator [], attribute
operator ., and an appropriate method of Pandas dataframe indexing such as loc, iloc,
at, iat, and some others. Essentially, there are two main ways of indexing Panda
DataFrames: label-based and position-based (aka location-based or integer-based).
Also, it is possible to apply boolean DataFrame indexing based on predefined
conditions, or even mix different types of DataFrame indexing.
For the next examples, we will create the following DataFrame to explore the effects of
various dataframe indexing methods:
4
import pandas as pd
print(df)
● Using the indexing operator: If we need to select all data from one or multiple
columns of a Pandas DataFrame, we can simply use the indexing operator []. To
select all data from a single column, we pass the name of this column. Before
proceeding, let's take a look at what the current labels of our DataFrame are. For
this purpose, we will use the attributes columns and index:
print(df.columns)
print(df.index)
Example 11: printing the second column using the label indexing operator.
print(df[‘col_2’])
● Using the loc indexer: If we need to select not only columns but also rows (or only
rows) from a dataframe, we can use the loc method, aka loc indexer. This method
implies using the indexing operator [] as well. This is the most common way of
accessing dataframe rows and columns by label. In general, the syntax
df.loc[row_label] is used to pull a specific row from a DataFrame as a Panda Series
object.
Example 12: selecting the values of the first row of the DataFrame using the loc
indexer
print(df.loc[0])
However, for our further experiments with the loc indexer, let's rename the row labels to
something more meaningful and of a string data type:
5
df.index = ['row_1', 'row_2', 'row_3', 'row_4', 'row_5', 'row_6',
'row_7', 'row_8', 'row_9', 'row_10']
Now try to run the following samples and figure out what returns:
Task 1.1: Print the first 4 rows using the slicing method.
Task 1.2: Print the last row with columns from col_5 to col_7.
● Using the at indexer: For the last case from the previous section, i.e., for selecting
only one value from a DataFrame, there is a faster method – using the at indexer.
The syntax is identical to that of the loc indexer, except that here we always use
exactly two labels (for the row and column) separated by a comma:
df.at['row_6', 'col_3']
Using this approach, each DataFrame element (row, column, or data point) is referred
to by its position number rather than its label. The position numbers are integers
starting from 0 for the first row or column (typical Python 0-based indexing) and
increasing by 1 for each subsequent row/column. Position-based indexing is purely
Python-style, i.e., the start bound of the range is inclusive while the stop bound is
exclusive. Position-based indexing uses indexing operator [], iloc, and iat methods.
Try to run the following samples and figure out what returns:
print(df[3:6])
print(df.iloc[3])
print(df.iloc[[9, 8, 7]])
print(df.iloc[0, [2, 8, 3]])
Print(df.iat[1, 2])
Task 1.3: Use iloc with position-based indexing to select rows 8, and 1 and columns 5 to
9.
Task 1.4: Use iloc with position-based indexing to select all rows and columns 1, 3, and
7.
6
column too (the comparison operators == and != make sense in such cases). In addition,
we can define several criteria for the same column or multiple columns. The operators
to be used for this purpose are & (and), | (or), ~ (not). Each condition must be put in a
separate pair of brackets.
Task 1.5: Selecting all the rows of the dataframe where the value of col_2 is NOT
greater than 15
Task 1.6: Selecting all the rows of the dataframe where the value of col_2 is greater
than 15 but not equal to 19
Here, DataFrame refers to the Pandas DataFrame that we want to export, and
filename refers to the name of the file that you want to save your data to.
The sep parameter specifies the separator that should be used to separate values in
the CSV file. By default, it is set to comma-separated values. We can also set it to a
different separator, like \t for tab-separated values.
The index parameter is a boolean value that determines whether to include the index
of the DataFrame in the CSV file. By default, it is set to False, which means the index is
not included.
The encoding parameter specifies the character encoding to be used for the CSV file.
By default, it is set to utf-8, which is a standard encoding for text files.
Example: Saving DataFrame to CSV
import pandas as pd
import pandas as pd
df = pd.DataFrame(Biodata)
7
2.6 Dealing with Rows and Columns in Panda's DataFrame
We can perform basic operations on rows/columns like deleting, adding, and renaming.
import pandas as pd
Note that the basic method will add the column at the end. By using DataFrame.insert()
method, it gives us the freedom to add a column at any position we like, not just at the
end. It also provides different options for inserting the column values.
Task 1.8: Add a new column (on the dataframe above) named Age with the following
values [21, 23, 24, 21] and make it the third column.
In Order to delete a column in Pandas DataFrame, we can use the drop() method.
Columns are deleted by dropping columns with column names.
import pandas as pd
8
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
'Age': [21, 23, 24, 21]}
As you can see from the output, the new output doesn’t have the passed columns.
Those values were dropped since the axis was set equal to 1 and the changes were
made in the original data frame since inplace was True. Note that, Use axis=1 or
columns param to remove columns. By default, Pandas return a copy of the DataFrame
after deleting columns, use the inpalce=True parameter to remove it from the existing
referring DataFrame.
Task 1.9: Remove column Height and column Age from the existing referring
dataframe.
There are multiple ways to add or insert a row to a Pandas DataFrame, in this section,
we will explain how to add a row to a Pandas DataFrame using the loc[] method. By
using df.loc[index]=list you can append a list as a row to the DataFrame at a specified
Index. In order to add at the end, get the index of the last record using the len(df)
function. The below example adds the list ["Hyperion",27000,"60days",2000] to the end
of the Pandas DataFrame.
Example 16: Adding a new row to the end of the Pandas DataFrame.
import pandas as pd
technologies= {
'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','35days', '40days','55days'],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
9
list = ["Hyperion", 27000, "60days", 2000]
df.loc[len(df)] = list
print(df)
Task 1.10: use .iloc to insert the new row above into the second position of the
DataFrame.
import pandas as pd
import numpy as np
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python"],
'Fee' :[20000,25000,26000,22000],
'Duration':['30day','40days',np.nan, None],
'Discount':[1000,2300,1500,1200]
}
indexes=['r1','r2','r3','r4']
df = pd.DataFrame(technologies,index=indexes)
print(df)
If you have a DataFrame with row labels (index labels), you can specify what rows you
want to remove by label names. Here is an example:
Alternatively, you can also write the same statement by using the field name 'index'.
df1 = df.drop(index=['r1','r2'])
print(df1)
Similarly, by using the drop() method, you can also remove rows by index position from
a Pandas DataFrame. The drop() method doesn’t have a position index as a parameter;
hence, we need to get the row labels from the index and pass these to the drop
method. We will use df.index to get row labels for the indexes we want to delete. Below
are some examples:
10
# Delete Rows 1 and 3 by Index numbers
df1=df.drop(df.index[[1,3]])
print(df1)
Most of the time, we would also need to remove DataFrame rows based on some
conditions (column value), Below are some quick examples applied to technology
dataframe.
# Remove rows
df2 = df[df.Fee >= 24000]
# Using loc
df2 = df.loc[df["Fee"] >= 24000]
Task 1.11: delete all rows with Fee >= 2200 and Discount == 2300. Save the result in a
file named task.csv
Task 1.12: Write a code to delete all rows with a Non/NaN value.
Task 1.13: Write a code to delete all rows having courses equal to Spark or Hadoop.
11
2.7 Combining DataFrames in Pandas
With concatenation, your datasets are just stitched together along an axis — either the
row axis or column axis. To concatenate two or more DataFrames vertically, you can
use the .concat() method with no parameters:
import pandas as pd
print(result)
Task 1.14: Perform the concatenation in the above example along columns.
12
The table below illustrates the merging types along with details on how to customize
the .merge() method in each case.
Inner Join: To perform an Left Outer Join: Use keys Right Outer Join: Use
inner join between two from the left frame only. keys from the right frame
DataFrames using a single To perform a left join only.
column, all we need is to between two pandas To perform a left join
provide the on argument DataFrames, you now between two pandas
when calling merge(). have to specify how='left' DataFrames, you now
when calling merge(). have to specify how='right'
df = pd.merge(df1, df2, when calling merge().
on='id') df1.merge(df2, on='id',
how='left') df = pd.merge(df1, df2,
how='right')
Full Outer Join: Use union Left Anti-Join: Use only Right Anti-Join: Use only
of keys from both frames. keys from the left frame keys from the right frame
that don’t appear in the that don’t appear in the
To perform a full outer join, right frame. left frame.
you need to specify A left anti-join in pandas
how='outer' when calling can be performed in two Similar to Left anti-join.
merge(). steps. In the first step, we
need to perform a left df3 = df1.merge(df2,
df = pd.merge(df1, df2, outer join with on='A', how='right',
how='outer') indicator=True. Then, we indicator=True)
simply need to query() the
result from the previous df = df3.loc[df3['_merge']
expression. == 'right_only', 'A']
13
== 'left_only', 'A']
d = df1[df1['A'].isin(df)]
d = df3[~df3['A'].isin(df)]
Task 1.15: Run the samples above and figure out what returns.
Task 1.16: Let’s say that we want to merge frames df1 and df2 using a left outer join,
Select all the columns from df1 but only column colE from df2.
14