0% found this document useful (0 votes)
9 views

Data Science - Sec3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Science - Sec3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data

Science

Section3
Pandas
Pandas

• Pandas is a Python library used for working with data sets.


• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
• Pandas allows us to analyze big data and make conclusions based on statistical
theories.
• Pandas can clean messy data sets and make them readable and relevant.
• Relevant data is very important in data science.
Pandas

• Pandas is a tool for data processing which helps in data analysis


• It provides functions and methods to efficiently manipulate large datasets.
• Data structure in Pandas :
• Series (one-dimensional array)
• DataFrame (two-dimensional array)
Pandas

• Install pandas :
• Pandas is usually imported under the pd alias.
▪ alias: In Python alias are an alternate name for referring to the same thing.
• the two most common terms used in Pandas :
▪ Series
▪ Dataframe
Series
• It is a one-dimensional array holding data of any type.
• A Pandas Series is like a column in a table.
• Labels in series:
▪ If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.
▪ With the index argument, you can name your own labels.

Custom index
Default index
Access Data in Series
• Panel Series support both label based, and position-based indexing.
• Example1 : access elements by label.
• Example2 : access elements by position.
Slicing in Series
• Example1 : Slicing by labels.
• [start_label : end_label]
• Including both
• Example2 : Slicing by positions.
• [start_index : end_index]
• End index not included.
• We can check size of series using
size method and get shape of
series using shape method.
DataFrame
• A Pandas DataFrame is a 2-
dimensional data structure,
like a 2-dimensional array, or a
table with rows and columns.
• Create a simple Pandas
DataFrame using a dictionary:
DataFrame
• Create a simple Pandas DataFrame using a nested lists:
DataFrame
• Pandas use the loc attribute to return one or more row(s)
DataFrame
• Pandas can also use the loc
attribute to return specified rows
without slicing.
CSV File
• A simple way to store big data
sets is to use CSV files
(comma separated files).
• Create CSV file :
CSV File
• Load the CSV into a DataFrame:

Excel File
• Create and Load the Excel file
into a DataFrame:

Exploratory analysis using
pandas
• Load the data.csv file into a
DataFrame ,then print it:
• If you have a large
DataFrame with many rows,
Pandas will only return the
first 5 rows, and the last 5
rows

Viewing the Data

• The head() method returns the headers


and a specified number of rows, starting
from the top.
• Note: if the number of rows is not specified,
the head() method will return the top 5
rows.
• The tail() method returns the headers
and a specified number of rows, starting
from the bottom.
Viewing the Data
• The DataFrames object has a method called
info(), that gives you more information about
the data set.
• The info() method also tells us how many Non-
Null values there are present in each column,
and in our data set it seems like there are 164
of 169 Non-Null values in the "Calories" column.
• Which means that there are 5 rows with no value at
all, in the "Calories" column, for whatever reason.
• Empty values, or Null values, can be bad when
analyzing data, and you should consider removing
rows with empty values. This is a step towards what
is called cleaning data
Viewing the Data
• Example1,2 : enable us to extract
specific subsets of data based on
defined condition.
• The output of the conditional expression
(>, but also ==, !=, <, <=,… would
work) is actually a pandas Series of
boolean values (either True or False)
with the same number of rows as the
original DataFrame. Such a Series of
boolean values can be used to filter the
DataFrame by putting it in between the
selection brackets []. Only rows for
which the value is True will be selected.
Viewing the Data
• Example : Select specific columns
Isin() method
• The isin() method checks if the
Dataframe contains the specified
value(s).
• Example1: Return rows that have values
80 or 90 in the “Duration” column.
Practical section
Steps:
• Download data set from this link :
• https://tinyurl.com/Sec3DS
• Import pandas
• Load “CardioGoodFitness.csv” file

You might also like