Pandas 1
Pandas 1
PANDAS
WHAT IS PANDAS?
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and
manipulating data.
• The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes
McKinney in 2008.
WHY USE PANDAS?
• Pandas allows us to analyze big data and make conclusions
based on statistical theories.
• Pandas can clean messy data sets, and make them readable
and relevant.
• Relevant data is very important in data science.
WHAT CAN PANDAS DO?
• Pandas gives you answers about the data. Like:
• Is there a correlation between two or more columns?
• What is average value?
• Max value? Min value? Pandas are also able to delete rows
that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
HOW TO IMPORT PANDAS?
• When you have created labels, you can access an item by referring
to the label.
WHAT IS KEY/VALUE OBJECTS AS SERIES ?
• To select only some of the items in the dictionary, use the index
argument and specify only the items you want to include in the
Series.
WHAT IS DATAFRAME?
• Data sets in Pandas are usually multi-dimensional tables, called
DataFrames.
• Series is like a column, a DataFrame is the whole table.
WHAT IS LOCATE ROW?
• As you can see from the result above, the DataFrame is like a table
with rows and columns.
• Pandas use the loc attribute to return one or more specified row(s)
• NOTE : When using [], the result is a Pandas DataFrame.
HOW TO NAMED INDEXES?
With the index argument, you can name your own indexes.
HOW TO LOCATE NAMED INDEXES?
• Use the named index in the loc attribute to return the specified
row(s).
HOW TO USE READ CSV IN PANDAS?
• A simple way to store big data sets is to use CSV files (comma
separated files).
• CSV files contains plain text and is a well know format that can be
read by everyone including Pandas.
• Tip: use to_string() to print the entire DataFrame. By default, when
you print a DataFrame, you will only get the first 5 rows, and the last
5 rows:
HOW TO LOAD FILES INTO A
DataFrame?
• If your data sets are stored in a file, Pandas can load them
into a DataFrame.
HOW TO ANALYZE DataFrame?
• By default, when you print a DataFrame, you will only get the first 5
rows, and the last 5 rows:
HOW TO VIEW & ANALYZE THE DATA?
• One of the most used method for getting a quick overview of the
DataFrame, is the head() method.
• The head() method returns the headers and a specified number of
rows, starting from the top.
• Note: if the number of rows is not specified, the head() method will
return the top 5 rows.
HOW TO VIEW & ANALYZE THE HEAD OF
DATA?
HOW TO VIEW & ANALYZE THE TAIL OF DATA?
• There is also a tail() method for viewing the last rows of the
DataFrame.
• The tail() method returns the headers and a specified number of
rows, starting from the bottom.
HOW TO CHECK INFO ABOUT THE
DATA?
• The DataFrames object has a method called info(), that gives you
more information about the data set.
HOW TO PERFORM DATA CLEANING?
• To only replace empty values for one column, specify the column
name for the DataFrame:
HOW TO FILL EMPTY VALUES USING MEAN,
MEDIAN, or MODE?
# Convert To DATE
HOW TO REMOVE UNWANTED ROWS?
• The result from the converting in the example above gave us a Na
value, which can be handled as a NULL value, and we can remove the
row by using the dropna() method.
HOW TO FIX WRONG DATA?
• One way to fix wrong values is to replace them with something else.
• For small data sets you might be able to replace the wrong data
one by one, but not for big data sets.
• To replace wrong data for larger data sets you can create some
rules, e.g. set some boundaries for legal values, and replace any
values that are outside of the boundaries.
HOW TO PERFORM REPLACE VALUES?
• Duplicate rows are rows that have been registered more than one
time.
• By taking a look at our test data set, we can assume that row 11 and
12 are duplicates.
• To discover duplicates, we can use the duplicated() method.
• The duplicated() method returns a Boolean values for each row:
HOW TO DISCOVER DUPLICATES?
HOW TO REMOVE DUPLICATES?
• To remove duplicates, use the drop_duplicates() method.
HOW TO CHECK PANDAS - DATA
CORRELATION?
• A great aspect of the Pandas module is the corr() method.
• The corr() method calculates the relationship between each column
in your data set.
WHAT IS PANDAS - DATA
CORRELATION?
• The corr() method ignores "not numeric" columns.
• Result Explained The Result of the corr() method is a table with a lot
of numbers that represents how well the relationship is between two
columns.
• The number varies from -1 to 1.
• 1 means that there is a 1 to 1 relationship (a perfect correlation), and
for this data set, each time a value went up in the first column, the
other one went up as well.
• 0.9 is also a good relationship, and if you increase one value, the
other will probably increase as well.
WHAT ARE THE TYPES OF PANDAS - DATA CORRELATION?
• -0.9 would be just as good relationship as 0.9, but if you increase one
value, the other will probably go down.
• 0.2 means NOT a good relationship, meaning that if one value goes
up does not mean that the other will.
• What is a good correlation? It depends on the use, but I think it is safe
to say you have to have at least 0.6 (or -0.6) to call it a good
correlation.
• Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000,
which makes sense, each column always has a perfect relationship
with itself.
WHAT ARE THE TYPES OF PANDAS - DATA CORRELATION?
• Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very
good correlation, and we can predict that the longer you work out,
the more calories you burn, and the other way around: if you burned
a lot of calories, you probably had a long work out.
• Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a
very bad correlation, meaning that we can not predict the max pulse
by just looking at the duration of the work out, and vice versa.
THANK YOU !!!