0% found this document useful (0 votes)

17 views17 pages

Lecture 7 Understanding Dataframes in Python and R

This document provides an overview of dataFrames in Python and R, focusing on the Pandas library in Python. It explains the functionalities of Pandas for data manipulation, including creating, cleaning, and analyzing dataFrames, as well as loading data from various file formats. Additionally, it covers similar concepts in R, including data frame creation, summarization, and manipulation techniques.

Uploaded by

lovelinechiri3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views17 pages

Lecture 7 Understanding Dataframes in Python and R

Uploaded by

lovelinechiri3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Lecture 7: Understanding dataFrames in Python and R

References
• Python DataFrame:
https://www.w3schools.com/python/pandas/pandas_dataframes.asp
• R DataFrame: https://www.w3schools.com/r/r_data_frames.asp

Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.

What is Pandas?
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

• Is there a correlation between two or more columns?

• What is average value?
• Max value?
• Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.

Where is the Pandas Codebase?

The source code for Pandas is located at this github repository https://github.com/pandas-
dev/pandas

{: github: enables many people to work on the same codebase.

Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very
easy.

Install it using this command:

C:\Users\Your Name>pip install pandas

If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.

Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:

import pandas

Now Pandas is imported and ready to use.

Example
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)

Pandas Series
What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Example

Create a simple Pandas Series from a list:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar

If nothing else is specified, the values are labeled with their index number. First value has
index 0, second value has index 1 etc.
This label can be used to access a specified value.

Example

Return the first value of the Series:

print(myvar[0])

With the index argument, you can name your own labels.

Example

Create your own labels:

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

When you have created labels, you can access an item by referring to the label.

Example

Return the value of "y":

print(myvar["y"])

Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

Example

Create a simple Pandas Series from a dictionary:

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the index argument and specify only
the items you want to include in the Series.

Example

Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

Example

Create a DataFrame from two Series:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)

Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example

Create a simple Pandas DataFrame:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:

df = pd.DataFrame(data)

print(df)

Result

calories duration
0 420 50
1 380 40
2 390 45

Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

Example

Return row 0:

#refer to the row index:

print(df.loc[0])

Result

calories 420
duration 50
Name: 0, dtype: int64

Note: This example returns a Pandas Series.

Example

Return row 0 and 1:

#use a list of indexes:
print(df.loc[[0, 1]])

Result

calories duration
0 420 50
1 380 40

Note: When using [], the result is a Pandas DataFrame.

Named Indexes
With the index argument, you can name your own indexes.

Example

Add a list of names to give each row a name:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

Result

calories duration
day1 420 50
day2 380 40
day3 390 45

Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

Example

Return "day2":
#refer to the named index:
print(df.loc["day2"])

Result

calories 380
duration 40
Name: 0, dtype: int64

Load Files Into a DataFrame

If your data sets are stored in a file, Pandas can load them into a DataFrame.

Example

Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)

Pandas Read CSV Files

A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv. or Open data.csv

Example

Load the CSV into a DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())

Tip: use to_string() to print the entire DataFrame.

If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and
the last 5 rows:

Example

Print the DataFrame without the to_string() method:

import pandas as pd
df = pd.read_csv('data.csv')
print(df)

max_rows
The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with

the pd.options.display.max_rows statement.

Example

Check the number of maximum returned rows:

import pandas as pd

print(pd.options.display.max_rows)

In my system the number is 60, which means that if the DataFrame contains more than 60
rows, the print(df) statement will return only the headers and the first and last 5 rows.

You can change the maximum rows number with the same statement.

Example

Increase the maximum number of rows to display the entire DataFrame:

import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df)
Pandas Read JSON
Big data sets are often stored, or extracted as JSON.

JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.

In our examples we will be using a JSON file called 'data.json'.

Open data.json.

Example

Load the JSON file into a DataFrame:

import pandas as pd

df = pd.read_json('data.json')

print(df.to_string())

Tip: use to_string() to print the entire DataFrame.

Dictionary as JSON

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a
DataFrame directly:

Example

Load a Python Dictionary into a DataFrame:

import pandas as pd

data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}

df = pd.DataFrame(data)

print(df)

Pandas - Exploring DataFrame

Viewing the Data

One of the most used method for getting a quick overview of the DataFrame, is
the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.
Example

Get a quick overview by printing the first 10 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head(10))

In our examples we will be using a CSV file called 'data.csv'.

Download data.csv, or open data.csv in your browser.

Note: if the number of rows is not specified, the head() method will return the top 5 rows.

Example

Print the first 5 rows of the DataFrame:

import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

There is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the
bottom.

Example

Print the last 5 rows of the DataFrame:

print(df.tail())

Info About the Data

The DataFrames object has a method called info(), that gives you more information about the
data set.

Example

Print information about the data:

print(df.info())

Result

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None

Result Explained

The result tells us there are 169 rows and 4 columns:

RangeIndex: 169 entries, 0 to 168

Data columns (total 4 columns):

And the name of each column, with the data type:

# Column Non-Null Count Dtype

--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64

Null Values

The info() method also tells us how many Non-Null values there are present in each column,
and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories"
column.

Which means that there are 5 rows with no value at all, in the "Calories" column, for
whatever reason.

Empty values, or Null values, can be bad when analyzing data, and you should consider
removing rows with empty values. This is a step towards what is called cleaning data, and
you will learn more about that in the next chapters.
R Data Frames
Data Frames

Data Frames are data displayed in a format as a table.

Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.

Use the data.frame() function to create a data frame:

Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Print the data frame

Data_Frame

Summarize the Data

Use the summary() function to summarize the data from a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame

summary(Data_Frame)

Access Items
We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame[1]

Data_Frame[["Training"]]

Data_Frame$Training

Add Rows
Use the rbind() function to add new rows in a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Add a new row

New_row_DF <- rbind(Data_Frame, c("Strength", 110, 110))

# Print the new row

New_row_DF

Add Columns
Use the cbind() function to add new columns in a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Add a new column

New_col_DF <- cbind(Data_Frame, Steps = c(1000, 6000, 2000))
# Print the new column
New_col_DF

Remove Rows and Columns

Use the c() function to remove rows and columns in a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Remove the first row and column

Data_Frame_New <- Data_Frame[-c(1), -c(1)]

# Print the new data frame

Data_Frame_New

Amount of Rows and Columns

Use the dim() function to find the amount of rows and columns in a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

dim(Data_Frame)

You can also use the ncol() function to find the number of columns and nrow() to find the
number of rows:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
ncol(Data_Frame)
nrow(Data_Frame)

Data Frame Length

Use the length() function to find the number of columns in a Data Frame (similar to ncol()):

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

length(Data_Frame

Combining Data Frames

Use the rbind() function to combine two or more data frames in R vertically:

Example
Data_Frame1 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame2 <- data.frame (

Training = c("Stamina", "Stamina", "Strength"),
Pulse = c(140, 150, 160),
Duration = c(30, 30, 20)
)

New_Data_Frame <- rbind(Data_Frame1, Data_Frame2)

New_Data_Frame

And use the cbind() function to combine two or more data frames in R horizontally:

Example
Data_Frame3 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame4 <- data.frame (
Steps = c(3000, 6000, 2000),
Calories = c(300, 400, 300)
)

New_Data_Frame1 <- cbind(Data_Frame3, Data_Frame4)

New_Data_Frame1

Introduction To Pandas
No ratings yet
Introduction To Pandas
14 pages
EDA Pandas
No ratings yet
EDA Pandas
228 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Notes On Pandas.
No ratings yet
Notes On Pandas.
7 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas
No ratings yet
Pandas
41 pages
Pandas Guide for Data Enthusiasts
No ratings yet
Pandas Guide for Data Enthusiasts
14 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
2 Pandas
No ratings yet
2 Pandas
22 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Pandas
No ratings yet
Pandas
16 pages
Importing Files Through Pandas
No ratings yet
Importing Files Through Pandas
16 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
16 pages
For Assignment-3 (Final - Pandas - Lab)
No ratings yet
For Assignment-3 (Final - Pandas - Lab)
40 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Pandas for Data Science Beginners
No ratings yet
Pandas for Data Science Beginners
41 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
Python Pandas Package
No ratings yet
Python Pandas Package
12 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas Notes
No ratings yet
Pandas Notes
44 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
41 pages
Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
Pandas (Assignment 3)
No ratings yet
Pandas (Assignment 3)
24 pages
Pandas
No ratings yet
Pandas
42 pages
Introduction To Pandas and Matplotlib: Dr. D. Kothandaraman Associate Professor, SCOPE, VITAP-University
No ratings yet
Introduction To Pandas and Matplotlib: Dr. D. Kothandaraman Associate Professor, SCOPE, VITAP-University
30 pages
Python Pandas
No ratings yet
Python Pandas
13 pages
Python Data Libraries Guide
No ratings yet
Python Data Libraries Guide
53 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
Pandas (Ziad)
No ratings yet
Pandas (Ziad)
38 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
33 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
6 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Python Pandas
No ratings yet
Python Pandas
34 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
9 pages
Python Pandas Beginner's Guide
No ratings yet
Python Pandas Beginner's Guide
45 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Practical Guide To Pandas For Data Science
100% (1)
Practical Guide To Pandas For Data Science
26 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Pandas
No ratings yet
Pandas
13 pages
Unit 4
No ratings yet
Unit 4
36 pages
Unit 3
No ratings yet
Unit 3
10 pages
Data Analytics Preparation & Visualization
No ratings yet
Data Analytics Preparation & Visualization
54 pages
Unit 4.2
No ratings yet
Unit 4.2
24 pages
12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
100% (1)
12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
24 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas
No ratings yet
Pandas
12 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Extamp
No ratings yet
Extamp
7 pages
B.Tech AI & ML Syllabus Guide
No ratings yet
B.Tech AI & ML Syllabus Guide
170 pages
CG Unit 6 Animation and Gaming
No ratings yet
CG Unit 6 Animation and Gaming
50 pages
Array Operations in C: Create, Delete, Search, Insert
No ratings yet
Array Operations in C: Create, Delete, Search, Insert
6 pages
2.enquiry Routines-R17
No ratings yet
2.enquiry Routines-R17
27 pages
Microcontroller Lab Manual
No ratings yet
Microcontroller Lab Manual
35 pages
Dsa 10
No ratings yet
Dsa 10
17 pages
Shantanu-3 3
No ratings yet
Shantanu-3 3
5 pages
VM Program Testing Guide
No ratings yet
VM Program Testing Guide
72 pages
Unit 3 - CP - 23JD
No ratings yet
Unit 3 - CP - 23JD
19 pages
FOP Question Bank
No ratings yet
FOP Question Bank
3 pages
Assembly-Design-journaling - Kopia
No ratings yet
Assembly-Design-journaling - Kopia
36 pages
EN234FEA Tutorial
No ratings yet
EN234FEA Tutorial
15 pages
Data Structures & Algorithms Quiz
No ratings yet
Data Structures & Algorithms Quiz
65 pages
Core - Subject Paper
No ratings yet
Core - Subject Paper
4 pages
C Programming Absolute Beginner's Guide (3rd Edition) Perry PDF Download
No ratings yet
C Programming Absolute Beginner's Guide (3rd Edition) Perry PDF Download
165 pages
Dec-10-Afc0-d Pdp-10 Fortran IV Programming Manual Aug69
No ratings yet
Dec-10-Afc0-d Pdp-10 Fortran IV Programming Manual Aug69
88 pages
Intro to C++ Programming Course
No ratings yet
Intro to C++ Programming Course
2 pages
Minor Semester
No ratings yet
Minor Semester
4 pages
Compiled n8n Docs
No ratings yet
Compiled n8n Docs
56 pages
Maharashtra HSC XII Computer Science Paper I 2015 Answer Key
0% (1)
Maharashtra HSC XII Computer Science Paper I 2015 Answer Key
6 pages
Introduction to Data Structures
No ratings yet
Introduction to Data Structures
18 pages
The Essential Guide To Image Processing Alan C. Bovik - Own The Complete Ebook With All Chapters in PDF Format
100% (1)
The Essential Guide To Image Processing Alan C. Bovik - Own The Complete Ebook With All Chapters in PDF Format
51 pages
100 Days DSA Roadmap
No ratings yet
100 Days DSA Roadmap
21 pages
Python Programming
No ratings yet
Python Programming
41 pages
OOP Programs 2025
No ratings yet
OOP Programs 2025
39 pages
NumPy Basics for AI Programming
No ratings yet
NumPy Basics for AI Programming
56 pages
Linked List Exclusive Notes 1654248048
No ratings yet
Linked List Exclusive Notes 1654248048
26 pages
DSQAn
No ratings yet
DSQAn
15 pages
Zoho Placement Material by PLACEMENTLELO
No ratings yet
Zoho Placement Material by PLACEMENTLELO
16 pages