lecture-week1
lecture-week1
Data Science
SIT112 | Data Science Concepts
Lecture Week 1
Contents
• What is Data Science?
• What You Need to Do Before Starting an Analysis
• The 5 Phases of Data Analysis and Visualization
• Getting Started with Python
• Using JupyterLab
The Unit Outline
Unit Chair
Dr. Davoud Mougouei
Academic Roles:
• Senior Lecturer @ School of IT, Deakin University
• Lecturer @ School of Computing and IT, University of Wollongong
• Lecturer @ School of Mathematics, Physics, and Computing, UniSQ
• Postdoctoral Research Fellow @ Faculty of IT, Monash University
Unit Chair (Cont.)
• https://www.researchgate.net/profile/Davoud-Mougouei
• http://globalaffects.org/
• https://ieeexplore.ieee.org/document/9709267?source=authoralert
• https://dl.acm.org/doi/abs/10.1145/3236024.3264843
• https://onlinelibrary.wiley.com/doi/abs/10.1002/hbe2.304
Teaching Team
The tutor will The students will The tutor will The students will
introduce the work on the provide more continue to work on
students to the activities/tasks guidance or walk the activities/tasks
workshop and recap while the tutor the students while the tutor
on the contents of oversees them and through the oversees them and
the previous week. answers tbheir solutions when answers their
questions. appropriate. questions.
Workshops: One-To-One Online Sessions
• Don’t be shy to ask for help!
• It is okay to not know, but it is not
okay to not ask ☺
• One-to-One Online Sessions
(breakouts) can be arranged
during the Online workshops; ask
your tutor!
• Use this responsibly as other
students might need help too ☺
Workshops: Dos and Don’ts!
• Please avoid emailing code or the screenshots of your code to the teaching team
outside workshop hours; instead, demonstrate your solutions (code/report) to
the tutors during the workshops.
• The tutors can help you fix your code in a one-to-one discussion. They can also
show you how to use external resources (e.g., ChatGPT) to fix your code while
improving your programming/problem solving skills.
• Please bring your own Windows/Mac/Linux laptop to the workshop (do not
use Tablet or Chromebooks for completing the tasks.
Assessment: Tasks
Task Definition
Pass Tasks P1-P8 To achieve the minimum acceptable standard for this unit. To complete the Pass tasks, students
should be able to comprehend and execute Data Science solutions implemented in Python and write
code for basic problems with guidance. End of the Unit Assessment (20%) is one of the Pass tasks.
Credit Tasks C1-C2 Students will apply what they have learnt in the pass tasks with less guidance. To complete the
Credit tasks, the students should be able to comprehend and execute Data Science solutions
implemented in Python and write code for basic problems with limited guidance.
Distinction Task D1 Students will apply their advanced knowledge to design and build solutions to a real-world scenario.
To complete the Distinction tasks, students should be able to comprehend and execute Data Science
solutions implemented in Python and write code for moderately complex problems, independently.
High Distinction HD1 Students will extend their understanding to demonstrate greater technical ability in developing
more complex solutions to a real-world scenario. To complete the High Distinction tasks, Students
should be able to comprehend and execute Data Science solutions implemented in Python and write
code for complex problems, independently.
Assessment : Set a Target Grade
Target Grade Minimum Requirements
Pass All the Pass tasks are completed.
Failure to complete any of the Pass tasks will result in a Fail grade.
Credit Minimum requirements for a Pass grade are met AND all Credit tasks are completed.
Distinction Minimum requirements for a Credit grade are met AND the Distention task are completed.
In addition, the student must create a video recording, presenting their completed task; they may
be required to answer questions or make changes to their code.
High Distinction Minimum requirement for a Distinction grade are met AND all High Distention tasks are
completed.
Interviews are required. The students might be asked questions about their submissions, and they
may be required to complete small tasks during the interviews.
Assessment: Tasks - OnTrack
Assessment: Complete and Submit the Tasks
• Knowledge and skills in this unit continuously build on those learnt the weeks before. Therefore, if you fall
behind it becomes difficult to understand the subsequent contents; try to submit your tasks by 11 am of
the due dates. If you miss a due date, you can still submit your task by the end of Week 12 (The Deadline).
However, only submissions by 11 am of the due date will receive feedback (via OnTrack).
• Having said that, you can still ask help on overdue tasks during the workshops, although the priority goes
to the task that are current (released and not due yet).
• Before completing any task, please read the instructions in the task description and task completion form;
submit via OnTrack: https://ontrack.deakin.edu.au/. Please note that Task P8 (End of Unit Assessment)
will be submitted via CloudDeakin; you don’t need to submit P8 via OnTrack. More information about End
of Unit Assessment will be provided later.
Assessment: Submission Items
Task Submission Items
Pass Tasks P1-P8 • P1: Sign and submit the assessment guideline via OnTrack.
• P2-P7: Submit the task completion report (PDF file) via OnTrack.
• P8: Submit the End of the Unit Assessment (EoUA) via Cloud Deakin
Credit Tasks C1-C2 • Submit the task completion report (PDF file) via OnTrack.
• Submit the Jupyter Notebook (ipynb file) via OnTrack.
Distinction Task D1 • Submit the task completion report (PDF file) via OnTrack; a link to the video recording
must be included in the task completion report.
• Submit the Jupyter Notebook (ipynb file) via OnTrack.
High Distinction Task HD1 • Submit the task completion report (PDF file) via OnTrack; a link to the video recording
must be included in the task completion report.
• Submit the Jupyter Notebook (ipynb file) via OnTrack.
Task
Completion
Report
Assessment: Feedback
Feedback Meaning Required Action
Complete The submission meets the essential ∙ No further action is required.
requirements of the task and is ready for
inclusion in the portfolio.
Discuss The tutor would like to discuss the ∙ Respond to the tutor’s questions via OnTrack.
submission with the student.
Demonstrate The tutor would like the student to ∙ Meet with the tutor (online/on-campus) to demonstrate your submission.
demonstrate the submission.
Fix and Resubmit The submission needs to be improved or ∙ Fix your submission and resubmit.
fixed. ∙ Maximum of 2 resubmissions are allowed per task, but only the first
resubmission will receive further feedback – only if it is received within 7
of the initial feedback. The 2nd resubmission can be made anytime by the
end of Week 12 (the Deadline) with no feedback.
Fail The submission has failed to meet the ∙ No action is required.
essential requirements of the task.
You are not allowed to make more than 3 submissions per task (original submission plus a maximum of 2 resubmissions).
If you mistakenly exceeded this limit, please contact your tutor; they will help you fix the issue.
Assessment: Portfolio submission
• By the end of Week 12, you will need to submit your final portfolio including all your completed tasks.
• Please note that Task P8 (End of Unit Assessment) will be submitted via Cloud Deakin; you don’t need to
submit P8 via OnTrack. More information about End of Unit Assessment will be provided later.
Assessment: End of Unit Assessment
• Non-Technical Questions:
• Directly email the examiner: [email protected]
• Your subject line must contain: UnitCode-StudentID-Subject
• You can expect an answer within 2 working days.
Discussions
• During the Workshops: we may allocate some
time to discussion.
• Outside the Workshops: you can continue to
discuss on the student forum.
• The student forum is to encourage discussion
among the students; they are not frequently
monitored by the teaching team.
Announcements
URL: https://doi.org/10.6028/NIST.SP.1500-1r2
What is Data Science?
NIST’s definition
NIST’s definition: “Data science is the methodology for the synthesis of useful knowledge
directly from data through a process of discovery or of hypothesis formulation and hypothesis
testing.”
Who is Data Scientist?
NIST Definition
NIST definition: “A data scientist is a practitioner who has sufficient knowledge in the overlapping
regimes of business needs, domain knowledge, analytical skills, and software and systems engineering
to manage the end to-end data processes in the analytics life cycle.”
Who is Data Scientist?
A Bad Joke from Joel Grus …
https://doi.org/10.6028/NIST.SP.1500-1r2
Data Science is Interdisciplinary
● Domain data and processes - set of values that share common meaning or purpose. For example Customer
database - customer name, address, phone number, email address.
● Algorithms - Algorithms act as an exact list of instructions that conduct specified actions step by step.
● Software and Systems Engineering - They are involved in software creation - like creating the concept, design
and coding of the software. They maintain the software throughout its life cycle.
● Analytical Systems - These are IT systems that process the information outputs produced by middleware. Analytic
systems may be comprised of databases, data processing software, and Web services.
● Statistics - Statistics is a branch of applied mathematics. It is used to collect and summarize data.
● Machine Learning - ML is the science of developing algorithms and statistical models that computer systems use
to perform complex tasks without explicit instructions.
https://doi.org/10.6028/NIST.SP.1500-1r2
● Data Mining - Data mining is the process of analyzing a large batch of information to discern trends and patterns.
Video and other resources
● What is statistics
● Machine Learning
https://doi.org/10.6028/NIST.SP.1500-1r2
What is Statistics?
What is statistics?
Descriptive & Inferential Statistics
• Managers may be more interested in high-level insights and strategic recommendations than in detailed technical
explanations.
• Clients may be more interested in how your analysis can help them solve specific problems or achieve specific goals.
They may also be more interested in visualizations and interactive tools that allow them to explore the data
themselves.
• By defining your target audience, you can also anticipate potential objections or questions they may have and
address them proactively in your analysis and presentation. This can help build credibility and trust with your
audience.
The 5 Phases of Data
Analysis and Visualization
The 5 Phases of Data Analysis and
Visualization: 1. Get the Data
• Process of collecting and acquiring data for analysis.
• Involves identifying and gathering relevant data from various sources, such
as databases, APIs, web scraping, surveys, or experiments.
• It also involves documenting the data sources
The 5 Phases of Data Analysis and
Visualization: 2. Clean the Data
Focuses on identifying and correcting errors and inconsistencies in the data:
• Remove unnecessary rows and columns: This involves eliminating any rows or columns in the
dataset that are not relevant or useful for the analysis, which can improve processing speed and reduce
noise in the data.
• Handle invalid or missing values: This involves identifying and addressing missing or invalid data
values in the dataset, which can impact the accuracy and reliability of the analysis. Common approaches
to handling missing or invalid values include deletion, substitution, or imputation (estimating the
values by preserving the statistical relationships).
• Change object data types to datetime or numeric data types: Converting data stored in object (text)
format into datetime or numeric formats, such as integers, floats, or timestamps, is necessary for many
analysis techniques that require specific data formats.
The 5 Phases of Data Analysis and
Visualization: 3. Prepare the Data
• Add columns that are derived from other columns: This involves creating new columns in a dataset
based on the values of other columns, using mathematical calculations or data transformations to
extract more meaningful information.
• Shape the data into the forms that are needed for your analysis: This involves structuring the
data in a way that makes it easier to analyze, including filtering, sorting, and grouping the data to
focus on relevant information.
• Make preliminary visualizations to better understand the data: This involves creating graphs,
charts, and other visual representations of the data to explore its patterns and relationships, and
to identify any outliers or anomalies that may need further investigation.
The 5 Phases of Data Analysis and
Visualization: 4. Analyze the Data
• Get new views of the data by grouping and aggregating the data: This involves
summarizing the data by creating subsets based on specific variables and then applying
aggregation functions such as sum, average, or count to calculate summary statistics for each
group.
• Make visualizations that provide insights and show relationships: This involves creating
charts, graphs, and other visual representations of the data to identify patterns, trends, and
relationships that may not be apparent in raw data.
• Model the data as part of predictive analysis: This involves building statistical models, such
as regression or decision trees, to predict future outcomes based on historical data and to
identify the key factors that drive those outcomes.
The 5 Phases of Data Analysis and
Visualization: 5. Visualize the Data
• Enhancing visualizations to make them appropriate for the target audience.
• Involves tailoring the presentation of data to effectively communicate insights and key
messages to a specific group of people. This can include modifying the design, layout, and level
of detail of the visualization to match the audience's level of expertise, interests, and
preferences.
• For example, a simple and intuitive visualization may be more suitable for a non-technical audience,
while a more complex and detailed visualization may be more appropriate for a group of data experts.
What’s the difference between data cleaning
and data preparation?
Answer …
What’s the difference between data cleaning
and data preparation?
https://docs.anaconda.com/anaconda/install/windows
https://docs.anaconda.com/anaconda/install/mac-os
https://docs.anaconda.com/anaconda/install/linux
LAUNCHING JUPYTERLAB FROM ANACONDA NAVIGATOR
TUTORIALS
● JupyterLab Tutorial - 1
● JupyterLab Tutorial - 2
Modules Included with Anaconda
Module Abbreviation Provides methods for
pandas pd Data analysis and visualization
numpy np Numerical computing
seaborn sns Data visualization
datetime dt Working with datetime objects
urllib Getting files from the web
zipfile Working with zip files
sqlite3 Working with a SQLite database
json Working with JSON data
Sklearn
Regression analysis
Two Ways to Install a Module
Conda-forge is a community-driven channel for Conda packages that provides many additional packages
beyond those available in the default Conda channels.
How to Import Modules
Import one module into the namespace specified by the ‘as’ clause
import pandas as pd
# sort the rows of the DataFrame polls in ascending order based on the
values in the 'startdate' column and return the first five rows of the
resulting DataFrame.
polls.sort_values('startdate').head()
How to use the Python type() function to check the data type of a variable
Two ways to Continue a Statement
With implicit continuation
polls.sort_values(
['state','startdate'],
ascending=False,
inplace=True)
With explicit continuation
polls.sort_values(['state','startdate'], \
ascending=False, \
inplace=True)
Using JupyterLab
Working with the Cells
How to select one or more cells
∙ To select one cell, position the pointer in the left margin of the cell so it becomes a crosshair,
and then click so a blue line is displayed.
∙ To select more than one cell, select the first cell, hold down the Shift key, and select the last
cell.