lecture-week2
lecture-week2
Data Analysis
SIT112 | Data Science Concepts
Lecture Week 2
Objectives
1. Create a DataFrame by reading data from a file or by using the DataFrame constructor.
2. Save a DataFrame to disk as a pickle file, and restore the DataFrame by reading the pickle file.
3. Examine the data in a DataFrame by displaying the data and its attributes.
4. Examine the data in a DataFrame by using the info(), nunique(), and describe() methods.
5. Access columns, rows, or a subset of columns and rows by using some combination of dot
notation, brackets, the query() method, and the loc[] or iloc[] accessor.
Column data types Each column has a defined data type. If all of the elements in a column don’t
have the same data type, the elements are stored with the object data type.
Metadata Attributes of the DataFrame that are generated by Pandas when the
DataFrame is constructed or changed.
Reading Data into a DataFrame
• Various types of data from external sources can be read into a DataFrame.
• Commonly used functions (methods) to read data into a DataFrame:
• read_csv(): Read data from a CSV file into a DataFrame.
• read_excel(): Read data from an Excel file into a DataFrame.
• read_sql(): Read data from a SQL query or database table into a DataFrame.
• read_json(): Read data from a JSON file into a DataFrame.
• read_html(): Read data from an HTML file or web page into a list of DataFrame objects.
• read_pickle(): Read a pickled object (serialized object) from a file into a DataFrame.
Reading Data into a DataFrame (Cont.)
Reading a CSV file from a Website
into a DataFrame
The DataFrame Constructor
• data: to specify the data that will be used to create the DataFrame. It can be a
List, a dictionary, another DataFrame, or a NumPy array.
• columns: to specify the column labels for the DataFrame. It can be a list or an
array.
• index: to specify the row labels for the DataFrame. It can be a list or an array.
DataFrame(data, columns, index)
Saving a DataFrame
• You can save a DataFrame to a file in a variety of formats using the to_
methods.
• The following are some commonly used to_ methods to save a
DataFrame:
• to_csv(): Save DataFrame to a comma-separated values (csv) file.
• to_excel(): Save DataFrame to an Excel file.
• to_pickle(): Save DataFrame to a pickle file.
• to_json(): Save DataFrame to a JSON file.
• to_html(): Save DataFrame to an HTML file.
• to_sql(): Save DataFrame to a SQL database.
Saving a DataFrame (Cont.)
Pickle Files
• A pickle file is a binary file that contains serialized objects, which can be
Python objects of any type, including lists, dictionaries, functions, and
even complex objects like classes and instances.
Attribute Description
values The values of the DataFrame in an array format
index The row index
columns The column names
size The total number of elements
shape The number of rows and columns
Display the Attributes of a DataFrame Object
Use the Columns Attribute
to Replace Spaces with Nothing
The info(), nunique(), and describe()
Methods
Method Description
info(params) Returns information about the DataFrame and its
columns.
Ask ChatGPT ☺
What’s the difference between loc() and
iloc() ?
How to Prepare the Data
Preparing Data
● Sorting the data
● Applying statistical methods
● The quantile method
● Column arithmetic
● Modifying the string data in a column
Sorting
the Data
Applying Statistical Methods
Applying Statistical Methods (Cont.)
The quantile() Method
Column
Arithmetic
Modifying the
String Data in
a Column
How to Shape the Data
Setting and
Using an
Index
Setting and
Using an
Index (Cont.)
Pivoting the Data
• Transforming rows into columns, or vice versa, in a dataset.
• To restructure data to make it more useful for analysis or reporting purposes.
• To pivot data, you typically identify a column of interest that you want to
become the new column headers, and a second column that contains the values
that should be placed under each new header.
Pivoting the
Data (Cont.)
Melting the Data
• The process of transforming a dataset from a wide format to a long format.
• Typically involves taking columns of data and converting them into rows.
• Can be useful for data analysis and visualization.
Melting the
Data (Cont.)
Analyze the Data
Group the
Data
Group the
Data(Cont.)
Aggregate the Data
Aggregate the
Data (Cont.)
Visualize the Data
Visualize the Data: Line Plot
Visualize the Data: Bar Chart
Additional Help: Help-Hub Sessions
∙ If you need assistance with your programming skills, please use the HelpHub sessions as listed
on CloudDeakin.
∙ You can ask programming questions and get limited support on the programming side of the
tasks.
∙ Please do not use HelpHub as a replacement for the Workshops.
Additional Help: Math Help
If you need help with math, please use the Maths Mentors Drop-in sessions: the Maths Mentors are
available Monday to Friday, 10 am – 2 pm through the Zoom Maths Mentor Online Drop-in or email
[email protected] anytime and the mentors will respond when they are next working.
References
• Data science from scratch: first principles with Python, Joel Grus, O'Reilly Media, 2019
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Wes McKinney, O'Reilly
Media, 3rd edition, 2022.
• Python Data Science Handbook: Essential Tools for Working with Data, Jake Vanderplas, O'Reilly Media,
2022
• Murach’s Python for Data Analysis, Scott McCoy, Mike Murach & Associates, Incorporated, 2021.
• Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The
Cloud, Paul Deitel, Pearson Education Limited, 2021.
• ChatGPT
End of lecture …