0% found this document useful (0 votes)

50 views55 pages

FIT1043 - Lecture 2 - 2024 Slides

Uploaded by

dilipkbose47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views55 pages

FIT1043 - Lecture 2 - 2024 Slides

Uploaded by

dilipkbose47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

FIT1043 Lecture 2

Introduction to Data Science

Mahsa Salehi

Faculty of Information Technology, Monash University

Semester 2, 2024
Additional resources

To familiarize yourself with the format and types

of questions on your final exam:

• Review the sample final exam available on

Moodle, under additional resources. While not
comprehensive, it provides a good overview.

• Take the weekly quizzes on Moodle under

each week, which mainly consist of questions
from previous years' final exams.
Weekly pre-class, home
activities
• We will have pre-class activities and/or
homework activities each week.

• Please check Moodle at the end of each week

to be prepared for the following week’s
content.
From last week: Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work

We will refer to this

throughout
the semester!
Discussion: Data Science Jobs
Data Science Job Market in Australia
► smaller (per capita) market compared to USA & UK, where giant
industry players are making better use of Data Science

Job Advertisements:
► communication skills and domain expertise are rated highly
► different jobs require different toolset skills
► see Adzuna’s CV upload page for an interesting application!
Unit Overview in Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work

Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessments Overview
Assessments:
• Assignment 1 (Weeks 2,3,4)
• Assignment 2 (Weeks 2-7)
• Assignment 3 (Weeks 8,9, 10)
• Final Exam (Weeks 1-12)

Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Unit Schedule
Week Activities Assignments
1 Overview of data science

2 Introduction to Python for data science

3 Data visualisation and descriptive statistics

4 Data sources and data wrangling

5 Data analysis theory Assignment 1

6 Regression analysis

7 Classification and clustering

8 Introduction to R for data science

9 Characterising data and "big" data Assignment 2

10 Big data processing

11 Issues in data management

12 Industry guest lecture Assignment 3

Outline

§ Introduction to Python for Data Science

§ Motivation to studying Python
§ Python data types
§ Essential libraries
Learning Outcomes (Week 2)

By the end of this week you should be able to:

► Comprehend the importance of Python as a data
science tool
► Comprehend essentials for coding in Python for data

science
► Explain and interpret given Python codes
► Comprehend the concept of a dataframe
► Work with data using data pre-processing commands
such as aggregating
Introduction to Python for Data
Science
From Python Data Science Handbook by
J. Vanderplas
The 2023 Top Programming
Languages

image src: IEEE

The 2022 Top Programming
Languages

1 8
18
2020

image src: IEEE

image src: Crossover.com
Data Science Preferred Tools
►Python’s Role in Data
Science

► Many tools out there for data

science.

► Python has gained popularity

over the last few years.
►easy to learn
►flexible and multi-purpose
►great libraries
►welldesigned computer
language
image src: kdnuggets.com
►good visualization for basic
analysis
Data Science Preferred Tools
Google “Data Science
►Python’s Role in Data
Preferred Tools”
Science

► Many tools out there for data

science.

► Python has gained popularity

over the last few years.
►easy to learn
►flexible and multi-purpose
►great libraries
►welldesigned computer
language
image src: kdnuggets.com
►good visualization for basic
analysis
Setting Up Python Environment
► Python 2.x vs 3.x

► IPython vs Jupyter Project

► IPython (Interactive Python) is a useful interactive
interface to Python, and provides a number of useful
syntactic additions to the language
► Jupyter provides a browser-based notebook useful
for development, collaboration and publication of
results.
Anaconda Project
► Allthe Best Tools in One Platform
► Anaconda is a package manager, an environment
manager, a Python/R data science distribution, and
a collection of over 1,500+ open source packages.
Anaconda is free and easy to install.

A desktop
graphical user
interface (GUI) to
use Anaconda
Poll
What is .ipynb?

A. An illegal file extension.

B. Interactive Python NoteBook.
C. Intelligent Python Nota Bene.
D. Typo, it should be ‘pinyin’ mahsasalehi868
Python Basic Types

► Integers
► Floating-Point Numbers
► Boolean
► True/False
► Strings
Integers (int)

► Python interprets a sequence of decimal (power of 10) digits

without any prefix (0b, 0o or 0x) to be a decimal number:

► 0b is interpreted as a binary sequence of numbers

>>> print(0b10)
2
► 0o is interpreted as an octal sequence of numbers (rarely
used)
>>> print(0o10)
8
► 0x is interpreted as a hexadecimal sequence of numbers
>>> print(0x10)
16
Floating Point (float)
► The values are specified ► For scientific notation style,
with a decimal point. the character e followed by a
positive or negative integer
>>> 4.2 may be used.
4.2
>>> type(4.2) >>> .4e7
float 4000000.0
>>> 4. >>> type(.4e7)
4.0 float
>>> 4.2e-4
0.00042
.4 : coefficient
e : 10 to the power of
7 : exponent
.4 × 107
Boolean (bool)
► Note that this type is only >>> type(True)
available in Python 3 and it is bool
not in Python 2. >>> type(False)
bool
► Boolean type (in any >>> print(True | False)
language) has one of two True
values, True or False
Strings (str)

► Strings are delimited using >>> print("I am a string.")

either the single or double I am a string.
quotes.
>>> type("I am a string.")
► Only
str
the characters between
the opening delimiter and
matching closing delimiter are
part of the string.
'
Strings (str)

► Handling strings can be a bit >>> print('you aren't

more complicated than we simple')
initially think. SyntaxError: invalid
character in identifier
► For example, if we want to
include quotes. >>> print("you aren't
► You aren’t simple simple")
you aren’t simple
Strings (str)
► The earlier example is just for >>> print('you aren\'t
the basics of putting the simple')
sequence of characters between you aren’t simple
the delimiters as a string.
► There are a few reserved
► There are many other special escape characters:
considerations to cater for
special characters in strings ►\t Tab
handling. ►\n New line
►\uxxxx 16-bit unicode character
► Use\ (back-slash) as the
escape character.
Dynamic Typed Language

For those who learned programming with static typed

languages, you will need to declare the variables, e.g., in C.

int x;

In Python, there is no declaration and it is only known at run-

time.

>>> x = 10
>>> print(type(x))

>>> x = ' Hello, world '

>>> print(type(x))
Built-in Functions

► Thereare more than 65 built-in functions in the current

Python version. These functions cover
► Maths
► Type Conversions
► Iterators
► Composite Data Types
► Classes, Attributes, and Inheritance
► Input/Output
► Variables, References, and Scope
► Others

► You can refer to them here

Operators and Strings
Manipulation
► Arithmetic operators >>> s = 'foobar'
+, -, *, /, % etc. >>> s[0]
► Comparison operators 'f'
>, <, <=, >=, !=, == >>> s[3]
► String operators 'b'
+, *, in >>> len(s)
6
>>> s[len(s)-1]
'r'
>>> s[-1]
'r'
Strings(useful for Data Science)
► String
subset ► Striding
>>> s = 'foobar' >>> s = 'foobar'
>>> s[2:5] >>> s[0:6:2]
'oba' 'foa'
>>> s[0:4] >>> s[1:6:2]
'foob' 'obr'
>>> s[2:]
'obar'
>>> s[:4] + s[4:]
'foobar'
>>> s[:4] + s[4:] == s
True
More Python Data Types

Lists and tuples are useful Python data types.

►A Python list is a collection of objects ► Lists are ordered.
(not necessary the same). ► Lists can contain any
► Listsare defined by square brackets arbitrary objects.
► List elements can be
that encloses a comma-separated
sequence of objects([]) accessed by index.
► Lists can be nested
>>> a = ['foo', 'bar', to arbitrary depth.
'baz', 'qux'] ► Lists are mutable.
>>> print(a) ► Lists are dynamic.
['foo', 'bar', 'baz', 'qux']
More Python Data Types

Tuple Dictionary
► Tuples are identical to lists ► Dictionary is similar to a list in
in all aspects except that the that it is a collection of objects.
content are immutable (fixed).
► Only difference is that list is
► Tuples are defined by round ordered and indexed by their

brackets (parentheses) that position whereas dictionary is

encloses a comma-separated indexed by the key.
sequence of objects (). ►Think of it as a key-value pair.
►This maps nicely to Data
Science when there is access to
NoSQL databases that stores
items in key-value pairs.
Dictionary
d = dict([ >>> person = {}
(<key>, <value>), >>> person['fname'] = ‘Ian'
(<key>, <value), >>> person['lname'] = ‘Tan'
. >>> person['age'] = 19
. >>> person['pets'] = {'dog':
. ‘Barney', 'cat': ‘Dino'}
(<key>, <value>) >>> person
]) {'fname': Ian', 'lname':
‘Tan', 'age': 19, 'pets':
{'dog': ‘Barney', 'cat':
'Dino'}}
Controls

Conditions Iterations
if <expr>: while <expr>:
<statement> <statement(s)>
elif <expr>:
<statement(s)> Python for loops link
elif <expr>:
<statement(s)>
else:
<statement(s)>

Note: Python uses indentation!

Essential Python and Data
Science
Specific libraries that are considered as the “starter pack” for
Data Science:
► Numpy: Scientific computing, support for multi-
dimensional arrays
► Pandas: Data structures as well as operations for
manipulating numerical tables.
► Matplotlib: library for visualization
► Scikit-learn: Python machine learning library that provides
the tools for data mining and data analysis

For some, you may also want to look at

► NLTK: Natural Language ToolKit to work with human
language data
Loading Libraries

The general syntax to include a library:

>>> import numpy as np

>>> import pandas as pd
>>> from matplotlib import pyplot as plt
>>> import matplotlib.pyplot as plt
Let’s Start!

► Data Science needs DATA

► Reading data
► Writing data

► We can read data from different sources

► Flat files
► CSV files
► Excel files
► Image files
► Relational databases
► NoSQL databases
► Web
Reading from CSV

► Pythonhas a built in CSV reader but for Data Science

purposes, we will use the pandas library.

► Assuming your file name is filename.csv

>>> import pandas as pd

>>> data = pd.read_csv("filename.csv")
>>> data.head()

>>> X = data[["Age"]]
>>> print(X)
Usual 1st Step upon Obtaining
Data
►A description or a summary of it.

► Sometimes, referred to as five number summary if the data

is numeric.
► Minimum, maximum, median, 1st quartile, 3rd quartile

► Work with pandas DataFrames.

>>> df = pd.DataFrame(data)
>>> print(df)

>>> df.describe()
Working with DataFrames (Basic)

► Select a column by using its column name:

>>> df['Name']
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer)
► Select multiple columns using a list of column names:
>>> df[['Name', 'Survived']]
Name Survived
0 Braund, Mr. Owen Harris 0
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer) 1
► Select a value using the column name and row index:
>>> df['Name'][3]
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
Working with DataFrames (Basic)
► Select a particular row from the table:
>>> df.loc[2]
PassengerId 3
Survived 1
Pclass 3
Name Heikkinen, Miss. Laina
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 2, dtype: object
Working with DataFrames (Basic)

► Select all rows with a particular value in one of the columns:

>>> df.loc[df['Age'] <= 6]

mahsasalehi868
Save the Data

► Assuming you just want to analyse a part of the data and you
want to save a resulting data frame to a CSV file.

>>> df2 = df.loc[df['Age'] >= 12]

>>> df2.to_csv ('output.csv', index =
None, header=True)

► We have now read, describe, basic data exploration and

save the data.
Working With Data

► There are some basic data pre-processing that are usually

done or at least taken into consideration.
► Categorical data
► Subsetting data
► Slicing
► Aggregating

► More will be explored in coming weeks

► Removing duplicates
► Dealing with dates
► Missing data
► Concatenating
► Transforming
Categorical Data

►A categorical data is one that has a specific value from a

limited set of values. The options are fixed.

►A ticket class is generally categorical, i.e. 1st class, 2nd class

& 3rd class.

>>> df.loc[df[‘Pclass'] == 1]

► We can create our own categories, e.g.

>>> import pandas as pd

>>> tix_class = pd.Series(['1st','2nd','3rd'],
dtype='category')
Subsetting Data

► We actually already have done this a few slides before J

► Extract only those that survived

>>> df.loc[df['Survived'] == 1]

► What does the code below return?

mahsasalehi868
>>> df.loc[(df['Sex'] == 'female') &
(df['Survived'] == 1)]
Slicing Data

► Slice rows by row index.

>>> df[:5]
>>> df[3:10]

► If
we only want certain columns, e.g. Age, Name, Sex,
Survived

>>> df.loc[:,
('Age','Name','Sex','Survived')]
Aggregating

► Likeour 5 number statistic, we can also obtain aggregated

values for columns. The total fare can be easily obtained by

>>> df['Fare'].sum()
4385.095600000001

► Or we can get the average age of the passenger by

>>> df['Age'].mean()
28.141507936507935

► Check the answers against the df.describe() earlier

Aggregating

► Like in SQL, we often want to know the aggregated values

for certain values from another column. Similarly, we can use
the groupby function:

>>> df.groupby('Sex')['Age'].mean()
Sex
female 24.468085
male 30.326962
Name: Age, dtype: float64
Aggregation and groupby

Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33

female 35 male 28.5

Gender Age
Gender Age
male 35
male 22
male 28.5
male 35
Aggregating

► What does the following mean?

>>>df.loc[df['Survived']==1].groupby('Sex')['Age'].mean()
Sex
female 26.265625
male 23.314444
Name: Age, dtype: float64

► Compare it with the previous statement, what can you tell

from it?
Poll
What is a dataframe?

A. An array.
B. A list.
C. A theory about data.
D. A structure that stores tabular data mahsasalehi868
Learning Outcomes (Recap)

This week we learnt the following:

► Importance of Python as a data science tool
► Comprehend essentials for coding in Python for data
science
► Explain and interpret given Python codes
► Comprehend the concept of a dataframe
► Work with data using data pre-processing commands

such as aggregating
Next few weeks

► We will be using Python for the next few weeks

► MatPlotLib
► Scikit-Learn
Applied Session- Week 2

§ Introductory Python for data science

§ Make sure participate in the the applied
session activities, very important for your
assignments 1&2
§ Use forum if you wish to swap your classes

Python GTU Study Material E-Notes Unit-1 12012021081509AM
100% (1)
Python GTU Study Material E-Notes Unit-1 12012021081509AM
29 pages
IoT System LU 3. Python Programming Language
No ratings yet
IoT System LU 3. Python Programming Language
55 pages
Java Programming - The Book For Beginners by Archies Gurav PDF
100% (1)
Java Programming - The Book For Beginners by Archies Gurav PDF
140 pages
Notes For All Chapters With Answer
No ratings yet
Notes For All Chapters With Answer
300 pages
Python Programming
No ratings yet
Python Programming
54 pages
Module 1 Materials
No ratings yet
Module 1 Materials
131 pages
Python Notes Unit 1 and 2
No ratings yet
Python Notes Unit 1 and 2
46 pages
Unit 3
No ratings yet
Unit 3
63 pages
Data Grip
No ratings yet
Data Grip
132 pages
PyTorch - Advanced Deep Learning
No ratings yet
PyTorch - Advanced Deep Learning
237 pages
Introduction To Programming With Python 3
No ratings yet
Introduction To Programming With Python 3
64 pages
Python 101: Understanding The Nuts and Bolts of Python
No ratings yet
Python 101: Understanding The Nuts and Bolts of Python
46 pages
Python: BY Kannan Moudgalya
No ratings yet
Python: BY Kannan Moudgalya
21 pages
Sonek Python Lesson 1 To 3
No ratings yet
Sonek Python Lesson 1 To 3
18 pages
Data Types in Python 6 Standard Data Types in P
No ratings yet
Data Types in Python 6 Standard Data Types in P
1 page
Unit 1 Part 1 - Final
No ratings yet
Unit 1 Part 1 - Final
40 pages
2 Introduction To Python Part 1 2
No ratings yet
2 Introduction To Python Part 1 2
94 pages
DA Python PDF
No ratings yet
DA Python PDF
41 pages
Python Learn Python in 24 Hours Robert Dwigh
100% (3)
Python Learn Python in 24 Hours Robert Dwigh
168 pages
Python For Data Science - The Basics - Data Science Parichay
No ratings yet
Python For Data Science - The Basics - Data Science Parichay
12 pages
IDAB Lecture9 2019
No ratings yet
IDAB Lecture9 2019
23 pages
Course Pack - Programming For Data Science
No ratings yet
Course Pack - Programming For Data Science
72 pages
NCMSEA 18 Proceedings
No ratings yet
NCMSEA 18 Proceedings
375 pages
Lec 4. Operators Expression and Data Types
No ratings yet
Lec 4. Operators Expression and Data Types
36 pages
Data Science Report
No ratings yet
Data Science Report
126 pages
Class-11-Preview of Python Notes
No ratings yet
Class-11-Preview of Python Notes
25 pages
Python Book Pages
No ratings yet
Python Book Pages
135 pages
Python
No ratings yet
Python
144 pages
01 Python Basics
No ratings yet
01 Python Basics
33 pages
Week 1
No ratings yet
Week 1
31 pages
UNIT 1 PPT - Final-Compressed
No ratings yet
UNIT 1 PPT - Final-Compressed
88 pages
Unit 1
No ratings yet
Unit 1
69 pages
Python - Learn Data Analytics Together's Group
No ratings yet
Python - Learn Data Analytics Together's Group
71 pages
1 - Introduction To Python
No ratings yet
1 - Introduction To Python
45 pages
People Analytics Python Training String
No ratings yet
People Analytics Python Training String
19 pages
Python Theory
No ratings yet
Python Theory
22 pages
Python Foundations and Tooling
No ratings yet
Python Foundations and Tooling
42 pages
Dsbda Ass1
No ratings yet
Dsbda Ass1
61 pages
Siebel MultiLingual Implementation
100% (1)
Siebel MultiLingual Implementation
11 pages
2 Python
No ratings yet
2 Python
131 pages
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
100% (3)
Ultimate Step by Step Guide To Machine Learning Using Python Predictive
56 pages
Chapter 1
No ratings yet
Chapter 1
20 pages
Examples of NFA
100% (1)
Examples of NFA
6 pages
AWK Part-3
100% (1)
AWK Part-3
10 pages
3.examples of TM
No ratings yet
3.examples of TM
12 pages
1664611760basics of Python 1
No ratings yet
1664611760basics of Python 1
74 pages
Introduction To Theory of Computation: KR Chowdhary Professor & Head
No ratings yet
Introduction To Theory of Computation: KR Chowdhary Professor & Head
8 pages
Barclays Exp
No ratings yet
Barclays Exp
15 pages
MapInfo Functions
100% (1)
MapInfo Functions
17 pages
1.1 (Co1, Co2)
No ratings yet
1.1 (Co1, Co2)
25 pages
Important Interview Questions On Python
No ratings yet
Important Interview Questions On Python
6 pages
Python Self Study Material
0% (1)
Python Self Study Material
9 pages
Decision Properties of Regular Languages
No ratings yet
Decision Properties of Regular Languages
33 pages
01 Python I All Master 13 02 2025
No ratings yet
01 Python I All Master 13 02 2025
258 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
4 Weeks Session 2 DA Fundamentals
No ratings yet
4 Weeks Session 2 DA Fundamentals
36 pages
Python Job Level Material
No ratings yet
Python Job Level Material
202 pages
Learn Java - String Methods Cheatsheet - Codecademy PDF
No ratings yet
Learn Java - String Methods Cheatsheet - Codecademy PDF
4 pages
Dsa (Week 1) - Python
No ratings yet
Dsa (Week 1) - Python
57 pages
AmiBroker Development Kit
No ratings yet
AmiBroker Development Kit
24 pages
Python Review
No ratings yet
Python Review
50 pages
Unit 1
No ratings yet
Unit 1
14 pages
Week 1
No ratings yet
Week 1
32 pages
Infytq Previous Slot Questions by Intellective Tech: Output: Sum of Num1 and Num2
No ratings yet
Infytq Previous Slot Questions by Intellective Tech: Output: Sum of Num1 and Num2
9 pages
02 Python Basics
No ratings yet
02 Python Basics
64 pages
Python Programming 19-06-2023
No ratings yet
Python Programming 19-06-2023
124 pages
Numpy and Matplotlib: Purushothaman.V.N March 10, 2011
No ratings yet
Numpy and Matplotlib: Purushothaman.V.N March 10, 2011
27 pages
AI DS Unit 5
No ratings yet
AI DS Unit 5
39 pages
Mon God Band Mongoose
No ratings yet
Mon God Band Mongoose
27 pages
DL06 Resumen de Instrucciones
No ratings yet
DL06 Resumen de Instrucciones
3 pages
COSC2429 - Intro To Programming Assessment 2 - Sem A 2021: RMIT Classification: Trusted
100% (1)
COSC2429 - Intro To Programming Assessment 2 - Sem A 2021: RMIT Classification: Trusted
3 pages
Python Basic
No ratings yet
Python Basic
6 pages
12 Computerscience Eng SM 2024
No ratings yet
12 Computerscience Eng SM 2024
291 pages
Flat KJR
No ratings yet
Flat KJR
19 pages
Python Notes
No ratings yet
Python Notes
19 pages
Introduction To Python 1
No ratings yet
Introduction To Python 1
13 pages
Introduction To Python
No ratings yet
Introduction To Python
16 pages
Activation Keys - Problem Description
No ratings yet
Activation Keys - Problem Description
4 pages
Jdegtget - Allmotype/ Jdegtget - Allmotypekeystr: Syntax
No ratings yet
Jdegtget - Allmotype/ Jdegtget - Allmotypekeystr: Syntax
5 pages
Coupling User v8p4
No ratings yet
Coupling User v8p4
46 pages
Girmitti Software Interview Questions
No ratings yet
Girmitti Software Interview Questions
2 pages
Python Class 3
No ratings yet
Python Class 3
29 pages
COMPUTER APPLICATIONS - Programming With Python
No ratings yet
COMPUTER APPLICATIONS - Programming With Python
4 pages
Unit 1
No ratings yet
Unit 1
43 pages
ECAP172
No ratings yet
ECAP172
5 pages
TOC Assignment 2
No ratings yet
TOC Assignment 2
1 page
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
janya lo
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet