Data Science - Sec3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Data

Science

Section3
Pandas
Pandas

• Pandas is a Python library used for working with data sets.


• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.
• Pandas allows us to analyze big data and make conclusions based on statistical
theories.
• Pandas can clean messy data sets and make them readable and relevant.
• Relevant data is very important in data science.
Pandas

• Pandas is a tool for data processing which helps in data analysis


• It provides functions and methods to efficiently manipulate large datasets.
• Data structure in Pandas :
• Series (one-dimensional array)
• DataFrame (two-dimensional array)
Pandas

• Install pandas :
• Pandas is usually imported under the pd alias.
▪ alias: In Python alias are an alternate name for referring to the same thing.
• the two most common terms used in Pandas :
▪ Series
▪ Dataframe
Series
• It is a one-dimensional array holding data of any type.
• A Pandas Series is like a column in a table.
• Labels in series:
▪ If nothing else is specified, the values are labeled with their index number. First
value has index 0, second value has index 1 etc.
▪ With the index argument, you can name your own labels.

Custom index
Default index
Access Data in Series
• Panel Series support both label based, and position-based indexing.
• Example1 : access elements by label.
• Example2 : access elements by position.
Slicing in Series
• Example1 : Slicing by labels.
• [start_label : end_label]
• Including both
• Example2 : Slicing by positions.
• [start_index : end_index]
• End index not included.
• We can check size of series using
size method and get shape of
series using shape method.
DataFrame
• A Pandas DataFrame is a 2-
dimensional data structure,
like a 2-dimensional array, or a
table with rows and columns.
• Create a simple Pandas
DataFrame using a dictionary:
DataFrame
• Create a simple Pandas DataFrame using a nested lists:
DataFrame
• Pandas use the loc attribute to return one or more row(s)
DataFrame
• Pandas can also use the loc
attribute to return specified rows
without slicing.
CSV File
• A simple way to store big data
sets is to use CSV files
(comma separated files).
• Create CSV file :
CSV File
• Load the CSV into a DataFrame:

Excel File
• Create and Load the Excel file
into a DataFrame:

Exploratory analysis using
pandas
• Load the data.csv file into a
DataFrame ,then print it:
• If you have a large
DataFrame with many rows,
Pandas will only return the
first 5 rows, and the last 5
rows

Viewing the Data

• The head() method returns the headers


and a specified number of rows, starting
from the top.
• Note: if the number of rows is not specified,
the head() method will return the top 5
rows.
• The tail() method returns the headers
and a specified number of rows, starting
from the bottom.
Viewing the Data
• The DataFrames object has a method called
info(), that gives you more information about
the data set.
• The info() method also tells us how many Non-
Null values there are present in each column,
and in our data set it seems like there are 164
of 169 Non-Null values in the "Calories" column.
• Which means that there are 5 rows with no value at
all, in the "Calories" column, for whatever reason.
• Empty values, or Null values, can be bad when
analyzing data, and you should consider removing
rows with empty values. This is a step towards what
is called cleaning data
Viewing the Data
• Example1,2 : enable us to extract
specific subsets of data based on
defined condition.
• The output of the conditional expression
(>, but also ==, !=, <, <=,… would
work) is actually a pandas Series of
boolean values (either True or False)
with the same number of rows as the
original DataFrame. Such a Series of
boolean values can be used to filter the
DataFrame by putting it in between the
selection brackets []. Only rows for
which the value is True will be selected.
Viewing the Data
• Example : Select specific columns
Isin() method
• The isin() method checks if the
Dataframe contains the specified
value(s).
• Example1: Return rows that have values
80 or 90 in the “Duration” column.
Practical section
Steps:
• Download data set from this link :
• https://tinyurl.com/Sec3DS
• Import pandas
• Load “CardioGoodFitness.csv” file

You might also like