0% found this document useful (0 votes)
49 views55 pages

FIT1043 - Lecture 2 - 2024 Slides

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views55 pages

FIT1043 - Lecture 2 - 2024 Slides

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

FIT1043 Lecture 2

Introduction to Data Science

Mahsa Salehi

Faculty of Information Technology, Monash University

Semester 2, 2024
Additional resources

To familiarize yourself with the format and types


of questions on your final exam:

• Review the sample final exam available on


Moodle, under additional resources. While not
comprehensive, it provides a good overview.

• Take the weekly quizzes on Moodle under


each week, which mainly consist of questions
from previous years' final exams.
Weekly pre-class, home
activities
• We will have pre-class activities and/or
homework activities each week.

• Please check Moodle at the end of each week


to be prepared for the following week’s
content.
From last week: Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work

We will refer to this


throughout
the semester!
Discussion: Data Science Jobs
Data Science Job Market in Australia
► smaller (per capita) market compared to USA & UK, where giant
industry players are making better use of Data Science

Job Advertisements:
► communication skills and domain expertise are rated highly
► different jobs require different toolset skills
► see Adzuna’s CV upload page for an interesting application!
Unit Overview in Our Standard Value Chain
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work

Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessments Overview
Assessments:
• Assignment 1 (Weeks 2,3,4)
• Assignment 2 (Weeks 2-7)
• Assignment 3 (Weeks 8,9, 10)
• Final Exam (Weeks 1-12)

Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Unit Schedule
Week Activities Assignments
1 Overview of data science

2 Introduction to Python for data science

3 Data visualisation and descriptive statistics

4 Data sources and data wrangling

5 Data analysis theory Assignment 1

6 Regression analysis

7 Classification and clustering

8 Introduction to R for data science

9 Characterising data and "big" data Assignment 2

10 Big data processing

11 Issues in data management

12 Industry guest lecture Assignment 3


Outline

§ Introduction to Python for Data Science


§ Motivation to studying Python
§ Python data types
§ Essential libraries
Learning Outcomes (Week 2)

By the end of this week you should be able to:


► Comprehend the importance of Python as a data
science tool
► Comprehend essentials for coding in Python for data

science
► Explain and interpret given Python codes
► Comprehend the concept of a dataframe
► Work with data using data pre-processing commands
such as aggregating
Introduction to Python for Data
Science
From Python Data Science Handbook by
J. Vanderplas
The 2023 Top Programming
Languages

image src: IEEE


The 2022 Top Programming
Languages

1 8
18
2020

image src: IEEE


image src: Crossover.com
Data Science Preferred Tools
►Python’s Role in Data
Science

► Many tools out there for data


science.

► Python has gained popularity


over the last few years.
►easy to learn
►flexible and multi-purpose
►great libraries
►welldesigned computer
language
image src: kdnuggets.com
►good visualization for basic
analysis
Data Science Preferred Tools
Google “Data Science
►Python’s Role in Data
Preferred Tools”
Science

► Many tools out there for data


science.

► Python has gained popularity


over the last few years.
►easy to learn
►flexible and multi-purpose
►great libraries
►welldesigned computer
language
image src: kdnuggets.com
►good visualization for basic
analysis
Setting Up Python Environment
► Python 2.x vs 3.x

► IPython vs Jupyter Project


► IPython (Interactive Python) is a useful interactive
interface to Python, and provides a number of useful
syntactic additions to the language
► Jupyter provides a browser-based notebook useful
for development, collaboration and publication of
results.
Anaconda Project
► Allthe Best Tools in One Platform
► Anaconda is a package manager, an environment
manager, a Python/R data science distribution, and
a collection of over 1,500+ open source packages.
Anaconda is free and easy to install.

A desktop
graphical user
interface (GUI) to
use Anaconda
Poll
What is .ipynb?

A. An illegal file extension.


B. Interactive Python NoteBook.
C. Intelligent Python Nota Bene.
D. Typo, it should be ‘pinyin’ mahsasalehi868
Python Basic Types

► Integers
► Floating-Point Numbers
► Boolean
► True/False
► Strings
Integers (int)

► Python interprets a sequence of decimal (power of 10) digits


without any prefix (0b, 0o or 0x) to be a decimal number:

► 0b is interpreted as a binary sequence of numbers


>>> print(0b10)
2
► 0o is interpreted as an octal sequence of numbers (rarely
used)
>>> print(0o10)
8
► 0x is interpreted as a hexadecimal sequence of numbers
>>> print(0x10)
16
Floating Point (float)
► The values are specified ► For scientific notation style,
with a decimal point. the character e followed by a
positive or negative integer
>>> 4.2 may be used.
4.2
>>> type(4.2) >>> .4e7
float 4000000.0
>>> 4. >>> type(.4e7)
4.0 float
>>> 4.2e-4
0.00042
.4 : coefficient
e : 10 to the power of
7 : exponent
.4 × 107
Boolean (bool)
► Note that this type is only >>> type(True)
available in Python 3 and it is bool
not in Python 2. >>> type(False)
bool
► Boolean type (in any >>> print(True | False)
language) has one of two True
values, True or False
Strings (str)

► Strings are delimited using >>> print("I am a string.")


either the single or double I am a string.
quotes.
>>> type("I am a string.")
► Only
str
the characters between
the opening delimiter and
matching closing delimiter are
part of the string.
'
Strings (str)

► Handling strings can be a bit >>> print('you aren't


more complicated than we simple')
initially think. SyntaxError: invalid
character in identifier
► For example, if we want to
include quotes. >>> print("you aren't
► You aren’t simple simple")
you aren’t simple
Strings (str)
► The earlier example is just for >>> print('you aren\'t
the basics of putting the simple')
sequence of characters between you aren’t simple
the delimiters as a string.
► There are a few reserved
► There are many other special escape characters:
considerations to cater for
special characters in strings ►\t Tab
handling. ►\n New line
►\uxxxx 16-bit unicode character
► Use\ (back-slash) as the
escape character.
Dynamic Typed Language

For those who learned programming with static typed


languages, you will need to declare the variables, e.g., in C.

int x;

In Python, there is no declaration and it is only known at run-


time.

>>> x = 10
>>> print(type(x))

>>> x = ' Hello, world '


>>> print(type(x))
Built-in Functions

► Thereare more than 65 built-in functions in the current


Python version. These functions cover
► Maths
► Type Conversions
► Iterators
► Composite Data Types
► Classes, Attributes, and Inheritance
► Input/Output
► Variables, References, and Scope
► Others

► You can refer to them here


Operators and Strings
Manipulation
► Arithmetic operators >>> s = 'foobar'
+, -, *, /, % etc. >>> s[0]
► Comparison operators 'f'
>, <, <=, >=, !=, == >>> s[3]
► String operators 'b'
+, *, in >>> len(s)
6
>>> s[len(s)-1]
'r'
>>> s[-1]
'r'
Strings(useful for Data Science)
► String
subset ► Striding
>>> s = 'foobar' >>> s = 'foobar'
>>> s[2:5] >>> s[0:6:2]
'oba' 'foa'
>>> s[0:4] >>> s[1:6:2]
'foob' 'obr'
>>> s[2:]
'obar'
>>> s[:4] + s[4:]
'foobar'
>>> s[:4] + s[4:] == s
True
More Python Data Types

Lists and tuples are useful Python data types.


►A Python list is a collection of objects ► Lists are ordered.
(not necessary the same). ► Lists can contain any
► Listsare defined by square brackets arbitrary objects.
► List elements can be
that encloses a comma-separated
sequence of objects([]) accessed by index.
► Lists can be nested
>>> a = ['foo', 'bar', to arbitrary depth.
'baz', 'qux'] ► Lists are mutable.
>>> print(a) ► Lists are dynamic.
['foo', 'bar', 'baz', 'qux']
More Python Data Types

Tuple Dictionary
► Tuples are identical to lists ► Dictionary is similar to a list in
in all aspects except that the that it is a collection of objects.
content are immutable (fixed).
► Only difference is that list is
► Tuples are defined by round ordered and indexed by their

brackets (parentheses) that position whereas dictionary is


encloses a comma-separated indexed by the key.
sequence of objects (). ►Think of it as a key-value pair.
►This maps nicely to Data
Science when there is access to
NoSQL databases that stores
items in key-value pairs.
Dictionary
d = dict([ >>> person = {}
(<key>, <value>), >>> person['fname'] = ‘Ian'
(<key>, <value), >>> person['lname'] = ‘Tan'
. >>> person['age'] = 19
. >>> person['pets'] = {'dog':
. ‘Barney', 'cat': ‘Dino'}
(<key>, <value>) >>> person
]) {'fname': Ian', 'lname':
‘Tan', 'age': 19, 'pets':
{'dog': ‘Barney', 'cat':
'Dino'}}
Controls

Conditions Iterations
if <expr>: while <expr>:
<statement> <statement(s)>
elif <expr>:
<statement(s)> Python for loops link
elif <expr>:
<statement(s)>
else:
<statement(s)>

Note: Python uses indentation!


Essential Python and Data
Science
Specific libraries that are considered as the “starter pack” for
Data Science:
► Numpy: Scientific computing, support for multi-
dimensional arrays
► Pandas: Data structures as well as operations for
manipulating numerical tables.
► Matplotlib: library for visualization
► Scikit-learn: Python machine learning library that provides
the tools for data mining and data analysis

For some, you may also want to look at


► NLTK: Natural Language ToolKit to work with human
language data
Loading Libraries

The general syntax to include a library:

>>> import numpy as np


>>> import pandas as pd
>>> from matplotlib import pyplot as plt
>>> import matplotlib.pyplot as plt
Let’s Start!

► Data Science needs DATA


► Reading data
► Writing data

► We can read data from different sources


► Flat files
► CSV files
► Excel files
► Image files
► Relational databases
► NoSQL databases
► Web
Reading from CSV

► Pythonhas a built in CSV reader but for Data Science


purposes, we will use the pandas library.

► Assuming your file name is filename.csv

>>> import pandas as pd


>>> data = pd.read_csv("filename.csv")
>>> data.head()

>>> X = data[["Age"]]
>>> print(X)
Usual 1st Step upon Obtaining
Data
►A description or a summary of it.

► Sometimes, referred to as five number summary if the data


is numeric.
► Minimum, maximum, median, 1st quartile, 3rd quartile

► Work with pandas DataFrames.

>>> df = pd.DataFrame(data)
>>> print(df)

>>> df.describe()
Working with DataFrames (Basic)

► Select a column by using its column name:


>>> df['Name']
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer)
► Select multiple columns using a list of column names:
>>> df[['Name', 'Survived']]
Name Survived
0 Braund, Mr. Owen Harris 0
1 Cumings, Mrs. John Bradley (Florence Briggs
Thayer) 1
► Select a value using the column name and row index:
>>> df['Name'][3]
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
Working with DataFrames (Basic)
► Select a particular row from the table:
>>> df.loc[2]
PassengerId 3
Survived 1
Pclass 3
Name Heikkinen, Miss. Laina
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: 2, dtype: object
Working with DataFrames (Basic)

► Select all rows with a particular value in one of the columns:

>>> df.loc[df['Age'] <= 6]

mahsasalehi868
Save the Data

► Assuming you just want to analyse a part of the data and you
want to save a resulting data frame to a CSV file.

>>> df2 = df.loc[df['Age'] >= 12]


>>> df2.to_csv ('output.csv', index =
None, header=True)

► We have now read, describe, basic data exploration and


save the data.
Working With Data

► There are some basic data pre-processing that are usually


done or at least taken into consideration.
► Categorical data
► Subsetting data
► Slicing
► Aggregating

► More will be explored in coming weeks


► Removing duplicates
► Dealing with dates
► Missing data
► Concatenating
► Transforming
Categorical Data

►A categorical data is one that has a specific value from a


limited set of values. The options are fixed.

►A ticket class is generally categorical, i.e. 1st class, 2nd class


& 3rd class.

>>> df.loc[df[‘Pclass'] == 1]

► We can create our own categories, e.g.

>>> import pandas as pd


>>> tix_class = pd.Series(['1st','2nd','3rd'],
dtype='category')
Subsetting Data

► We actually already have done this a few slides before J

► Extract only those that survived

>>> df.loc[df['Survived'] == 1]

► What does the code below return?


mahsasalehi868
>>> df.loc[(df['Sex'] == 'female') &
(df['Survived'] == 1)]
Slicing Data

► Slice rows by row index.

>>> df[:5]
>>> df[3:10]

► If
we only want certain columns, e.g. Age, Name, Sex,
Survived

>>> df.loc[:,
('Age','Name','Sex','Survived')]
Aggregating

► Likeour 5 number statistic, we can also obtain aggregated


values for columns. The total fare can be easily obtained by

>>> df['Fare'].sum()
4385.095600000001

► Or we can get the average age of the passenger by

>>> df['Age'].mean()
28.141507936507935

► Check the answers against the df.describe() earlier


Aggregating

► Like in SQL, we often want to know the aggregated values


for certain values from another column. Similarly, we can use
the groupby function:

>>> df.groupby('Sex')['Age'].mean()
Sex
female 24.468085
male 30.326962
Name: Age, dtype: float64
Aggregation and groupby

Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33

female 35 male 28.5


Gender Age
Gender Age
male 35
male 22
male 28.5
male 35
Aggregating

► What does the following mean?

>>>df.loc[df['Survived']==1].groupby('Sex')['Age'].mean()
Sex
female 26.265625
male 23.314444
Name: Age, dtype: float64

► Compare it with the previous statement, what can you tell


from it?
Poll
What is a dataframe?

A. An array.
B. A list.
C. A theory about data.
D. A structure that stores tabular data mahsasalehi868
Learning Outcomes (Recap)

This week we learnt the following:


► Importance of Python as a data science tool
► Comprehend essentials for coding in Python for data
science
► Explain and interpret given Python codes
► Comprehend the concept of a dataframe
► Work with data using data pre-processing commands

such as aggregating
Next few weeks

► We will be using Python for the next few weeks


► MatPlotLib
► Scikit-Learn
Applied Session- Week 2

§ Introductory Python for data science


§ Make sure participate in the the applied
session activities, very important for your
assignments 1&2
§ Use forum if you wish to swap your classes

You might also like