0% found this document useful (0 votes)
3 views

lecture-week2

The document outlines essential concepts for data analysis using Pandas, focusing on creating, saving, and examining DataFrames. It covers methods for accessing and manipulating data, including sorting, pivoting, and visualizing data. Additionally, it provides resources for further assistance and references for deeper learning in data science.

Uploaded by

trminhselflearn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture-week2

The document outlines essential concepts for data analysis using Pandas, focusing on creating, saving, and examining DataFrames. It covers methods for accessing and manipulating data, including sorting, pivoting, and visualizing data. Additionally, it provides resources for further assistance and references for deeper learning in data science.

Uploaded by

trminhselflearn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Essentials for

Data Analysis
SIT112 | Data Science Concepts
Lecture Week 2
Objectives
1. Create a DataFrame by reading data from a file or by using the DataFrame constructor.

2. Save a DataFrame to disk as a pickle file, and restore the DataFrame by reading the pickle file.

3. Examine the data in a DataFrame by displaying the data and its attributes.

4. Examine the data in a DataFrame by using the info(), nunique(), and describe() methods.

5. Access columns, rows, or a subset of columns and rows by using some combination of dot
notation, brackets, the query() method, and the loc[] or iloc[] accessor.

6. Use Pandas methods to get statistics for the columns of a DataFrame.


Objectives
7. Perform calculations on the data in the columns of a DataFrame.
8. Use replace() method to replace the data in a DataFrame.
9. Use Pandas methods to do the following DataFrame operations:
• Sort the rows
• Set and reset an index
• Pivot the data
• Melt the data
• Group and aggregate the data
• Plot the data based on the index that has been set
Introduction to the
Pandas DataFrame
What is a DataFrame?
• A DataFrame (or DataFrame object) is a Pandas object that stores the data for an analysis. This
object also provides the attributes and methods for working with the data.
The DataFrame Components
Component Description
Column labels The names at the tops of the columns.
Column data The data in the columns. All of the data in a column typically has the same
data type with one entry in each row.

Column data types Each column has a defined data type. If all of the elements in a column don’t
have the same data type, the elements are stored with the object data type.

Index Also known as a row label. If an index isn’t defined, it is generated as a


sequence of integers starting with zero.

Metadata Attributes of the DataFrame that are generated by Pandas when the
DataFrame is constructed or changed.
Reading Data into a DataFrame
• Various types of data from external sources can be read into a DataFrame.
• Commonly used functions (methods) to read data into a DataFrame:
• read_csv(): Read data from a CSV file into a DataFrame.
• read_excel(): Read data from an Excel file into a DataFrame.
• read_sql(): Read data from a SQL query or database table into a DataFrame.
• read_json(): Read data from a JSON file into a DataFrame.
• read_html(): Read data from an HTML file or web page into a list of DataFrame objects.
• read_pickle(): Read a pickled object (serialized object) from a file into a DataFrame.
Reading Data into a DataFrame (Cont.)
Reading a CSV file from a Website
into a DataFrame
The DataFrame Constructor

Creates a new DataFrame object

The parameters of the DataFrame() constructor

• data: to specify the data that will be used to create the DataFrame. It can be a
List, a dictionary, another DataFrame, or a NumPy array.
• columns: to specify the column labels for the DataFrame. It can be a list or an
array.
• index: to specify the row labels for the DataFrame. It can be a list or an array.
DataFrame(data, columns, index)
Saving a DataFrame
• You can save a DataFrame to a file in a variety of formats using the to_
methods.
• The following are some commonly used to_ methods to save a
DataFrame:
• to_csv(): Save DataFrame to a comma-separated values (csv) file.
• to_excel(): Save DataFrame to an Excel file.
• to_pickle(): Save DataFrame to a pickle file.
• to_json(): Save DataFrame to a JSON file.
• to_html(): Save DataFrame to an HTML file.
• to_sql(): Save DataFrame to a SQL database.
Saving a DataFrame (Cont.)
Pickle Files
• A pickle file is a binary file that contains serialized objects, which can be
Python objects of any type, including lists, dictionaries, functions, and
even complex objects like classes and instances.

• The pickle module in Python provides a way to serialize and de-serialize


objects to and from a file.

• Pickling is the process of converting a Python object into a byte stream,


and unpickling is the inverse operation of loading the byte stream back
into the Python object.
Pickle Files (Cont.)
• The pickle module is used for various purposes, such as:
• Saving and loading machine learning models
• Saving and loading large data structures
• Caching intermediate results to disk
Saving a DataFrame into a Pickle File
How to Examine the Data
How to examine data
● Display a DataFrame
○ Display the first few rows of a DataFrame
○ Display the last few rows of a DataFrame
○ Display data in 5 rows and all columns
○ Attributes of the DataFrame object
○ Display the attributes of the DataFrame object
○ Use the Columns attribute to Replace spaces with nothing
○ The info(), nunique() and describe() methods

● Using the info() method


○ Using the nunique() method
○ Using the describe method
Display a
DataFrame
Display the first Few Rows of a DataFrame
Display the Last Few Rows of a DataFrame
Display the Data in 5 Rows and All Columns
Some of the Attributes of a
DataFrame Object

Attribute Description
values The values of the DataFrame in an array format
index The row index
columns The column names
size The total number of elements
shape The number of rows and columns
Display the Attributes of a DataFrame Object
Use the Columns Attribute
to Replace Spaces with Nothing
The info(), nunique(), and describe()
Methods

Method Description
info(params) Returns information about the DataFrame and its
columns.

nunique() Returns the number of unique data items in each column.

describe() Returns statistical information for each numeric column.


Using the info()
Method
Using the nunique() Method
Using the
describe()
Method
How to Access the Columns and Rows
Accessing rows and columns
● Accessing columns with dot notation
● Accessing columns with brackets
● Accessing columns using loc()
● Accessing rows using query() based on one column
● Accessing rows using query() based on multiple columns with AND
● Accessing rows using query() based on multiple columns with OR
● Accessing rows using loc()
● Access a subset of rows and columns using query()
● Access rows and columns using loc()
○ Difference between loc() and iloc()
Accessing Columns with Dot Notation
Accessing Columns with Brackets
Accessing Columns using loc()
Accessing Rows using query()
based on One Column
Accessing Rows using query()
based on Multiple Columns with AND
Accessing Rows using query()
based on Multiple Columns with OR
Accessing
Rows
using loc()
Accessing Rows using loc()
Access a Subset of Rows and Columns
using query ()
Access a Subset of Rows and Columns
using query()
Access a Subset of Rows and Columns
using loc()
Access Rows and Columns
using iloc[]
What’s the difference between loc() and
iloc() ?

Ask ChatGPT ☺
What’s the difference between loc() and
iloc() ?
How to Prepare the Data
Preparing Data
● Sorting the data
● Applying statistical methods
● The quantile method
● Column arithmetic
● Modifying the string data in a column
Sorting
the Data
Applying Statistical Methods
Applying Statistical Methods (Cont.)
The quantile() Method
Column
Arithmetic
Modifying the
String Data in
a Column
How to Shape the Data
Setting and
Using an
Index
Setting and
Using an
Index (Cont.)
Pivoting the Data
• Transforming rows into columns, or vice versa, in a dataset.
• To restructure data to make it more useful for analysis or reporting purposes.
• To pivot data, you typically identify a column of interest that you want to
become the new column headers, and a second column that contains the values
that should be placed under each new header.
Pivoting the
Data (Cont.)
Melting the Data
• The process of transforming a dataset from a wide format to a long format.
• Typically involves taking columns of data and converting them into rows.
• Can be useful for data analysis and visualization.
Melting the
Data (Cont.)
Analyze the Data
Group the
Data
Group the
Data(Cont.)
Aggregate the Data
Aggregate the
Data (Cont.)
Visualize the Data
Visualize the Data: Line Plot
Visualize the Data: Bar Chart
Additional Help: Help-Hub Sessions
∙ If you need assistance with your programming skills, please use the HelpHub sessions as listed
on CloudDeakin.
∙ You can ask programming questions and get limited support on the programming side of the
tasks.
∙ Please do not use HelpHub as a replacement for the Workshops.
Additional Help: Math Help
If you need help with math, please use the Maths Mentors Drop-in sessions: the Maths Mentors are
available Monday to Friday, 10 am – 2 pm through the Zoom Maths Mentor Online Drop-in or email
[email protected] anytime and the mentors will respond when they are next working.
References
• Data science from scratch: first principles with Python, Joel Grus, O'Reilly Media, 2019
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Wes McKinney, O'Reilly
Media, 3rd edition, 2022.
• Python Data Science Handbook: Essential Tools for Working with Data, Jake Vanderplas, O'Reilly Media,
2022
• Murach’s Python for Data Analysis, Scott McCoy, Mike Murach & Associates, Incorporated, 2021.
• Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The
Cloud, Paul Deitel, Pearson Education Limited, 2021.
• ChatGPT
End of lecture …

You might also like