0% found this document useful (0 votes)
7 views50 pages

Pandas 1

Pandas is a Python library for data manipulation and analysis, created by Wes McKinney in 2008. It provides tools for cleaning, exploring, and analyzing data sets, including functionalities for handling missing values, duplicates, and data correlation. Key components of Pandas include Series and DataFrames, which facilitate the organization and analysis of data in a tabular format.

Uploaded by

jaiswalarunima8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views50 pages

Pandas 1

Pandas is a Python library for data manipulation and analysis, created by Wes McKinney in 2008. It provides tools for cleaning, exploring, and analyzing data sets, including functionalities for handling missing values, duplicates, and data correlation. Key components of Pandas include Series and DataFrames, which facilitate the organization and analysis of data in a tabular format.

Uploaded by

jaiswalarunima8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

CAREERERA

PANDAS
WHAT IS PANDAS?
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
• The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes
McKinney in 2008.
WHY USE PANDAS?
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
• Pandas can clean messy data sets, and make them readable
and relevant.
• Relevant data is very important in data science.
WHAT CAN PANDAS DO?
• Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
• Max value? Min value? Pandas are also able to delete rows
that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
HOW TO IMPORT PANDAS?

• There are two ways to import pandas


import pandas:- This will import the entire pandas module.
from pandas import*:- This will import all class, objects,
variables etc. from pandas package. here * means all.
HOW TO USE import pandas?
HOW TO USE from pandas import*
WHAT IS SERIES?
• A Pandas Series is like a column in a table.
• It is a one-dimensional array holding data of any type.
WHAT IS LABLES?
• If nothing else is specified, the values are labeled with their index
number. First value has index 0, second value has index 1 etc.
• This label can be used to access a specified value.
HOW TO CREATE LABLES?
• With the index argument, you can name your own labels.
HOW TO ACCESS VALUES IN LABELS?

• When you have created labels, you can access an item by referring
to the label.
WHAT IS KEY/VALUE OBJECTS AS SERIES ?

• We can also use a key/value object, like a dictionary, when creating a


Series.
• Note: The keys of the dictionary become the labels.
HOW TO INCLUDE SPECIFIC ITEMS IN
SERIES?

• To select only some of the items in the dictionary, use the index
argument and specify only the items you want to include in the
Series.
WHAT IS DATAFRAME?
• Data sets in Pandas are usually multi-dimensional tables, called
DataFrames.
• Series is like a column, a DataFrame is the whole table.
WHAT IS LOCATE ROW?
• As you can see from the result above, the DataFrame is like a table
with rows and columns.
• Pandas use the loc attribute to return one or more specified row(s)
• NOTE : When using [], the result is a Pandas DataFrame.
HOW TO NAMED INDEXES?
With the index argument, you can name your own indexes.
HOW TO LOCATE NAMED INDEXES?

• Use the named index in the loc attribute to return the specified
row(s).
HOW TO USE READ CSV IN PANDAS?

• A simple way to store big data sets is to use CSV files (comma
separated files).
• CSV files contains plain text and is a well know format that can be
read by everyone including Pandas.
• Tip: use to_string() to print the entire DataFrame. By default, when
you print a DataFrame, you will only get the first 5 rows, and the last
5 rows:
HOW TO LOAD FILES INTO A
DataFrame?
• If your data sets are stored in a file, Pandas can load them
into a DataFrame.
HOW TO ANALYZE DataFrame?
• By default, when you print a DataFrame, you will only get the first 5
rows, and the last 5 rows:
HOW TO VIEW & ANALYZE THE DATA?

• One of the most used method for getting a quick overview of the
DataFrame, is the head() method.
• The head() method returns the headers and a specified number of
rows, starting from the top.
• Note: if the number of rows is not specified, the head() method will
return the top 5 rows.
HOW TO VIEW & ANALYZE THE HEAD OF
DATA?
HOW TO VIEW & ANALYZE THE TAIL OF DATA?

• There is also a tail() method for viewing the last rows of the
DataFrame.
• The tail() method returns the headers and a specified number of
rows, starting from the bottom.
HOW TO CHECK INFO ABOUT THE
DATA?

• The DataFrames object has a method called info(), that gives you
more information about the data set.
HOW TO PERFORM DATA CLEANING?

• Data cleaning means fixing bad data in your data set.


• Bad data could be:
Empty cells
Data in wrong format
Wrong data
Duplicates
WHAT IS PANDAS - CLEANING EMPTY
CELLS?
• Empty cells can potentially give you a wrong result when you analyze
data.
• Remove Rows
• One way to deal with empty cells is to remove rows that contain
empty cells.
• This is usually OK, since data sets can be very big, and removing a few
rows will not have a big impact on the result.
• Note: By default, the dropna() method returns a new DataFrame, and
will not change the original.
• If you want to change the original DataFrame, use the inplace = True
argument:
HOW TO PERFORM PANDAS - CLEANING EMPTY
CELLS?
HOW TO PERFORM PANDAS - CLEANING EMPTY
CELLS?
HOW TO REPLACE EMPTY VALUES?

• Another way of dealing with empty cells is to insert a new value


instead.
• This way you do not have to delete entire rows just because of some
empty cells.
• The fillna() method allows us to replace empty cells with a value.
WHAT ARE THE STEPS TO REPLACE EMPTY
VALUES?
HOW TO REPLACE EMPTY VALUES ONLY FOR A SPECIFIED COLUMNS?

• To only replace empty values for one column, specify the column
name for the DataFrame:
HOW TO FILL EMPTY VALUES USING MEAN,
MEDIAN, or MODE?

• A common way to replace empty cells, is to calculate the mean,


median or mode value of the column.
• Pandas uses the mean() median() and mode() methods to calculate
the respective values for a specified column:
• Mean = the average value (the sum of all values divided by number of
values).
• Median = the value in the middle, after you have sorted all values
ascending.
• Mode = the value that appears most frequently.
HOW TO FILL EMPTY VALUES USING MEAN ?
HOW TO FILL EMPTY VALUES USING MEDIAN ?
HOW TO FILL EMPTY VALUES USING MODE ?
HOW TO PERFORM PANDAS-CLEANING DATA OF WRONG
FORMAT?

• Cells with data of wrong format can make it difficult, or even


impossible, to analyze data.
• To fix it, you have two options: remove the rows, or convert all
cells in the columns into the same format.
WHAT ARE THE STEPS TO CONVERT INTO A
CORRECT FORMAT?

# Convert To DATE
HOW TO REMOVE UNWANTED ROWS?
• The result from the converting in the example above gave us a Na
value, which can be handled as a NULL value, and we can remove the
row by using the dropna() method.
HOW TO FIX WRONG DATA?

• "Wrong data" does not have to be "empty cells" or "wrong format", it


can just be wrong, like if someone registered "199" instead of "1.99".
• Sometimes you can spot wrong data by looking at the data set,
because you have an expectation of what it should be.
• If you take a look at our data set, you can see that in row 7, the
duration is 450, but for all the other rows the duration is between 30
and 60.
• It doesn't have to be wrong, but taking in consideration that this is
the data set of someone's workout sessions, we conclude with the
fact that this person did not work out in 450 minutes.
WHAT IS REPLACING VALUES?

• One way to fix wrong values is to replace them with something else.

• For small data sets you might be able to replace the wrong data
one by one, but not for big data sets.
• To replace wrong data for larger data sets you can create some
rules, e.g. set some boundaries for legal values, and replace any
values that are outside of the boundaries.
HOW TO PERFORM REPLACE VALUES?

• Loop through all values in the "Duration" column.


• If the value is higher than 120, set it to 120:
HOW TO PERFORM REMOVING ROWS?
• Another way of handling wrong data is to remove the rows that contains
wrong data.
• This way you do not have to find out what to replace them with, and
there is a good chance you do not need them to do your analyses.
WHAT ARE DUPLICATES?

• Duplicate rows are rows that have been registered more than one
time.
• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
HOW TO DISCOVER DUPLICATES?
HOW TO REMOVE DUPLICATES?
• To remove duplicates, use the drop_duplicates() method.
HOW TO CHECK PANDAS - DATA
CORRELATION?
• A great aspect of the Pandas module is the corr() method.
• The corr() method calculates the relationship between each column
in your data set.
WHAT IS PANDAS - DATA
CORRELATION?
• The corr() method ignores "not numeric" columns.
• Result Explained The Result of the corr() method is a table with a lot
of numbers that represents how well the relationship is between two
columns.
• The number varies from -1 to 1.
• 1 means that there is a 1 to 1 relationship (a perfect correlation), and
for this data set, each time a value went up in the first column, the
other one went up as well.
• 0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
WHAT ARE THE TYPES OF PANDAS - DATA CORRELATION?

• -0.9 would be just as good relationship as 0.9, but if you increase one
value, the other will probably go down.
• 0.2 means NOT a good relationship, meaning that if one value goes
up does not mean that the other will.
• What is a good correlation? It depends on the use, but I think it is safe
to say you have to have at least 0.6 (or -0.6) to call it a good
correlation.
• Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship
with itself.
WHAT ARE THE TYPES OF PANDAS - DATA CORRELATION?

• Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very
good correlation, and we can predict that the longer you work out,
the more calories you burn, and the other way around: if you burned
a lot of calories, you probably had a long work out.
• Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a
very bad correlation, meaning that we can not predict the max pulse
by just looking at the duration of the work out, and vice versa.
THANK YOU !!!

You might also like