0% found this document useful (0 votes)
1K views5 pages

Mastering Pandas - Important Pandas Functions For Your Next Project

The document discusses important pandas functions for data science projects. It describes functions for string access like .str.len() and .str.contains(), datetime access like .dt.day and .dt.month, plotting with .plot(), dummy encoding with pd.get_dummies(), querying data with df.query(), and selecting data types with df.select_dtypes. These pandas functions help clean, preprocess and analyze data more efficiently.

Uploaded by

dchandra15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views5 pages

Mastering Pandas - Important Pandas Functions For Your Next Project

The document discusses important pandas functions for data science projects. It describes functions for string access like .str.len() and .str.contains(), datetime access like .dt.day and .dt.month, plotting with .plot(), dummy encoding with pd.get_dummies(), querying data with df.query(), and selecting data types with df.select_dtypes. These pandas functions help clean, preprocess and analyze data more efficiently.

Uploaded by

dchandra15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 5

Mastering Pandas: Important Pandas Functions For Your

Next Project

Pandas library has been an all-time favorite for all Data Scientists or
analysts because of its easy-to-use nature, a wide range of
functionalities, and better interpretation of the results. Any individual
starting their Data Science journey is advised to have a good command
over pandas, come up with pipelines to reduce the manual effort of
cleaning and preprocessing the data.
Pandas is built over Numpy which allows faster execution of commands
and getting the work done in less time. In this article, we will share some
underrated pandas functions that can enrich your project’s code quality.
Before moving ahead, here is a quick legend:
 All the commands mentioned assume that the data frame is named
as ‘df’ which is an object of pd.DataFrame()
 The Pandas library has been imported as an alias as ‘pd’.
String Accessors
String or text data contributes a major part to a dataset. Whether it is
information related to the author, title, publication of a book, or tweets
made for a particular hashtag, we have a lot of text data and this data
comes in handy when cleaned properly and feed to any classifier like
Naive Bayes, etc. Here are some tricks you can apply:
 To access the string type data, use the ‘str’ accessor. For example,
df[‘column_name’].str
 This makes it possible to do all the string operations on the column
selected.
 Some common operations include, 
o df[‘column_name’].str.len(): length of each string
o .str.split(): Splitting at particular character
o .str.contains(): Returns T/F about whether the particular word
is present in the string
o .str.count(): Returns the count of rows satisfying the regular
expression passed. 
o .str.findall(): Returns the results which match the expression
passed.
o .str.replace(): Same as findall but here replacement of
matched items occur
o All string operations such
as .title, .isalpha, .isalnum, .isdecimal etc are supported.

Datetime Accessors
Dates and time are commonly present in datasets in the form of
timestamps, start time, end time, or any other timing associated with that
event. It is useful to parse this data properly as it gives trends along a
timeline that can be put out to predict future events or we call quote it as
time-series analysis. Let’s see some useful commands:
 To access the DateTime data, convert the current data type (date
values are parsed as string or object) to DateTime using the
pd.to_datetime() function.
 Now, using the ‘.dt’ accessor, we can access any DateTime
information required such as :
o df[‘column_name’].dt.day: Returns the day of the date.
o .dt.time: Time
o .dt.year: Year of the date
o .dt.month: Month of the date
o .dt.weekday: Whether it is Sunday, Monday… in the
numerical form where 0 represents Monday. If you want day names,
then use .dt.day_name
o .dt.is_month_start: Returns T/F depending on whether the
date is the first of the month.
o .dt.is_month_end Same functionality as month_start but here
the last date of the month is verified.
o .dt.quater: Returns in which quarter the date lies
o .dt.is_quater_start:  Returns T/F whether the date is the first
day of the quarter
o .dt.is_quater_end: whether it is the last day of the quarter
o .dt.normalize: When the time component does not add a
valuable contribution to the analysis, it can be ignored. This command
rounds off the time to midnight i.e., 00:00:00. 
Pandas Plotting
Plotting visualizations is one of the key components of Data Analysis
and plays a major role while performing feature engineering. For
example, outliers in a dataset can be detected using box plots which
represents the median and interquartile range, leaving outliers at the
extreme ends.
Plotting is done mostly via other libraries such as seaborn, plotly, bokeh,
matplotlib, but when you want to instantly visualize data without
explicitly defining the libraries? Pandas got the solution. Using the
pd.plot() function, you can directly plot graphs that are invoked
internally using matplotlib. Various options available for this:
 df.plot() or df[‘column_name’].plot() (depending upon type of
graph) 
 df.plot() has parameter ‘kind’ which defines the graph. By default,
it is a ‘line’ plot but other options available are ‘bar’, ‘barh’, ‘box’,
‘hist’, ‘kde’ etc.
 It invokes matplotlib backend that means we can access its
arguments via an ‘ax’ accessor. 
 .plot() function can also take arguments such as ‘title’, ‘xticks’,
‘xlim’, ‘xlabel’, ‘fontsize’, ‘colormap’ which eradicates the need of
defining external libraries up to some extent. 

Miscellaneous Functions
 pd.get_dummies(): While preprocessing data, sometimes we are
encountered with categorical data that needs to be converted into
numerical form to be fed to the model. When these categories are fairly
low, one-hot encoding is preferred, but doing this manually takes along.
This dummies function not only transforms the values but, if drop_first
set to True, drops the previous column containing all the categories.
 df.query(): It is the function that allows you to apply the
conditional mask over the data frame. The basic difference between this
and normal masking is that this function directly returns the values
instead of the boolean mask, reducing the effort of creating the mask and
applying it to the data frame.
 df.select_dtypes(): Sometimes we need to perform some specific
tasks on one type of data type. For example, while reading data from
external files, some data types are defined as objects. While cleaning the
data, the dataset must have all the correct data types, and doing it
manually by df.astype(‘data-type’) would be tedious when the number
of such data types is large. This function selects the specified data type
and it can be combined with the .apply() function. A sample code would
look like this:
df.select_dtypes(object).apply(astype(str))

Conclusion
This assignment is referred to as chaining, and it is very common while
doing data science tasks to reduce the effort of defining variables for
every step to be performed.
If you are curious to learn about Pandas, check out IIIT-B &
upGrad’s PG Diploma in Data Science which is created for working
professionals and offers 10+ case studies & projects, practical hands-on
workshops, mentorship with industry experts, 1-on-1 with industry
mentors, 400+ hours of learning and job assistance with top firms.
, to_datetime(), value_counts(). These functions are extremely important for Data Scientists
and Data Analysts. The functions help to view data, edit values, return outcomes, cast, access
datasets, change formats, find unique and duplicate values, merge data, and sort data. ”
image-2=”” count=”3″ html=”true” css_class=””]

You might also like