S08 Slides
S08 Slides
Seminar 8:
Data Preparation
Learning Objectives
In this lecture, you will learn
• Dataframe
• Data wrangling
• Data cleaning
1
3/8/22
Data
• Data are created from varying sources.
• Automatically or manually collected.
• Data may be:
– Redundant
– Inconsistent
– Inaccurate
2
3/8/22
Data Management
Create
Preserve Collect
Data
Derive Process / We are
Decision Transform here
Analyze
3
3/8/22
Numpy module
• Numpy stands for Numerical Python
• Essential package for data computation.
• Introduces the use of NumPy arrays for
compact and faster reading and writing
operations.
• Mainly used for data manipulation.
• It is the foundational library which SciPy,
Scikit-learn and etc are based on.
• Must-know library for data science.
7
Pandas module
• It’s a powerful and flexible open source for data
analysis.
• Provides rich set of data structures to work on
structured data.
• The primary object that we will be using is DataFrame
object.
• DataFrame is a two dimensional, table like structure
organised into column with header and row number
corresponds to each record.
• The other data structure in panda is Series for 1
dimensional data.
8
4
3/8/22
Scipy module
• For additional data processing.
• High level visualization, numerical
processing, and optimizations.
• Scipy contains several useful subpackages:
– cluster
– linear algebra
– stats
Matplotlib module
• The most popular Python library for visualization.
• It is widely accepted due to its ability to support
different operating system and output types.
• Can make interactive plotting and data exploration.
• pyplot in matplotlib package is the main interface for
plotting.
• Other modules are available that can wrap around
this matplotlib module to produce more powerful
visualization:
– seaborn
– ggpy
10
10
5
3/8/22
Import conventions
• To use the above modules, the analytics and
data science community has adopted the
following convention for consistency:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sub_module_name
11
11
Understanding data
• Examine the data as a whole.
• Understand the contexture meaning for
columns and rows.
• Total number of records.
• The format and unit of data.
• Look out for inconsistent formatting.
• Handling missing data.
12
12
6
3/8/22
Example
• What are the problems with these data
from csv file?
13
13
Problems
1. Missing headers
2. Duplicated record
3. Missing values
4. Inconsistent unit
14
14
7
3/8/22
15
15
Fix #1 explained
• The code started by importing the important pandas module for data
cleaning, and use pd as pointer to point to pandas.
• Header is defined using the list structure to hold multiple strings.
• Next calling read_csv() function from pandas through the pandas
pointer. Parse in the file name of csv file, together with header
previously defined and set as names of header.
• pd.read_csv() returns an important structure -> dataframe
• The last line of code displays the first 5 records of data from dataframe.
16
16
8
3/8/22
Dataframe
• Dataframe stores tabular data or 2D data into a single
variable.
• Each row corresponds to an observation, each column
corresponds to a variable.
• Features of dataframe:
– Mutable size: data can be added and shrank.
– Each data point is identifiable by row index number and
header name.
– Mathematic operations can be done on row or column.
• Constructs of dataframe:
– pandas.DataFrame( data, index, columns, dtype, copy)
17
17
18
18
9
3/8/22
19
20
20
10
3/8/22
21
21
22
22
11
3/8/22
Missing Value
Original Dataframe inserted sentinels
missing data like NaN for missing value
23
23
Missing value
• Fill in the missing value with acquisition_filled using fillna() function.
• The same can be done to fill it with mean by calling the mean() on
column variable with numeric value.
24
24
12
3/8/22
Missing value
• Other handling methods can be used accordingly:
– Fill missing value with appropriate statistical value,
eg. Median, mean, mode etc.
– Predict missing value using predictive model or
algorithm.
– Dropping the record
• df.dropna()
• df.dropna(axis=0, how='any', thresh=None, subset=None,
inplace=False)
– https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html
25
25
• For inconsistent unit, we can write code to remove the unit like “m”
and convert to same base unit by multiplying. This involves
searching for the respective data then conversion.
• Some useful commands:
– if “m” in string_variable: # returns true if m exists in the string variable.
– Use float() or int() to convert text value to float or int.
26
26
13
3/8/22
27
27
14