0% found this document useful (0 votes)
14 views14 pages

S08 Slides

Uploaded by

mathycheok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

S08 Slides

Uploaded by

mathycheok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3/8/22

Seminar 8:
Data Preparation

Learning Objectives
In this lecture, you will learn
• Dataframe
• Data wrangling
• Data cleaning

1
3/8/22

How to make sense out of messy


data?

Picture from https://radiobruxelleslibera.files.wordpress.com/2014/04/111030-retention.png


3

Data
• Data are created from varying sources.
• Automatically or manually collected.
• Data may be:
– Redundant
– Inconsistent
– Inaccurate

2
3/8/22

Data Management

Create

Preserve Collect

Data
Derive Process / We are
Decision Transform here

Analyze

Essential python modules


• Python can do data transformation, conversion
readily with the right use of modules.
• Several modules are central to data management,
they are:
– numPy
– pandas
– matplotlib
– sciPy
• Anaconda comes with all these installed.
6

3
3/8/22

Numpy module
• Numpy stands for Numerical Python
• Essential package for data computation.
• Introduces the use of NumPy arrays for
compact and faster reading and writing
operations.
• Mainly used for data manipulation.
• It is the foundational library which SciPy,
Scikit-learn and etc are based on.
• Must-know library for data science.
7

Pandas module
• It’s a powerful and flexible open source for data
analysis.
• Provides rich set of data structures to work on
structured data.
• The primary object that we will be using is DataFrame
object.
• DataFrame is a two dimensional, table like structure
organised into column with header and row number
corresponds to each record.
• The other data structure in panda is Series for 1
dimensional data.
8

4
3/8/22

Scipy module
• For additional data processing.
• High level visualization, numerical
processing, and optimizations.
• Scipy contains several useful subpackages:
– cluster
– linear algebra
– stats

Matplotlib module
• The most popular Python library for visualization.
• It is widely accepted due to its ability to support
different operating system and output types.
• Can make interactive plotting and data exploration.
• pyplot in matplotlib package is the main interface for
plotting.
• Other modules are available that can wrap around
this matplotlib module to produce more powerful
visualization:
– seaborn
– ggpy
10

10

5
3/8/22

Import conventions
• To use the above modules, the analytics and
data science community has adopted the
following convention for consistency:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sub_module_name
11

11

Understanding data
• Examine the data as a whole.
• Understand the contexture meaning for
columns and rows.
• Total number of records.
• The format and unit of data.
• Look out for inconsistent formatting.
• Handling missing data.
12

12

6
3/8/22

Example
• What are the problems with these data
from csv file?

13

13

Problems
1. Missing headers
2. Duplicated record
3. Missing values
4. Inconsistent unit

14

14

7
3/8/22

Data Cleaning (Fix #1)


• pandas module is good at cleaning up
data.
• To read in csv data file, pandas also
comes with read csv option.
• We can first target on the missing header,
add in missing headers as follows:

15

15

Fix #1 explained

• The code started by importing the important pandas module for data
cleaning, and use pd as pointer to point to pandas.
• Header is defined using the list structure to hold multiple strings.
• Next calling read_csv() function from pandas through the pandas
pointer. Parse in the file name of csv file, together with header
previously defined and set as names of header.
• pd.read_csv() returns an important structure -> dataframe
• The last line of code displays the first 5 records of data from dataframe.
16

16

8
3/8/22

Dataframe
• Dataframe stores tabular data or 2D data into a single
variable.
• Each row corresponds to an observation, each column
corresponds to a variable.
• Features of dataframe:
– Mutable size: data can be added and shrank.
– Each data point is identifiable by row index number and
header name.
– Mathematic operations can be done on row or column.
• Constructs of dataframe:
– pandas.DataFrame( data, index, columns, dtype, copy)
17

17

Dataframe useful row operations


• Row is accessed through the use of index.
Getting specific row or rows (slicing) using
loc[index]:
Getting first row Getting second to forth rows

Ø df.drop(number) # drop certain row


Ø df.append(another_df) # to insert another dataframe into df

18

18

9
3/8/22

Dataframe useful column


operations
• Column is identified by column’s name.
• We could first use df.columns to get all columns’ names.

• Adding a column requires first the construction of another dataframe


with same structure then use “+” to add to existing dataframe.
– Eg. df = df + df will add the dataframe to itself and treats data as string.
• Deleting column using “pop(column_name)” remove a column.
– Eg. df.pop(‘acquiree’) # remove the first column 19

19

Data Cleaning (Fix #2)


• Problem #2 -> Duplicated data
Record 6 & 9 are
duplicates

20

20

10
3/8/22

Data Cleaning (Fix #2)


• df.duplicated() • Use drop_duplicates() from
returns Boolean dataframe to remove
value of True if it duplicates.
is a duplicated
record.
Before After

21

21

Data Cleaning (Fix #3)


• Problem #3 -> Missing values
• Missing values are common, speak to relevant person
on how missing values should be resolved.
• Some ways to handle missing values:
– Removal: remove the record with missing values
– Re-design data collection: to ensures every field is filled,
giving user options to select instead of allowing empty
value field.
– Substitute an appropriate value.
– Use appropriate replacement, for example mean value or
use values of highest occurrences .

22

22

11
3/8/22

Missing Value
Original Dataframe inserted sentinels
missing data like NaN for missing value

23

23

Missing value
• Fill in the missing value with acquisition_filled using fillna() function.

• The same can be done to fill it with mean by calling the mean() on
column variable with numeric value.

24

24

12
3/8/22

Missing value
• Other handling methods can be used accordingly:
– Fill missing value with appropriate statistical value,
eg. Median, mean, mode etc.
– Predict missing value using predictive model or
algorithm.
– Dropping the record
• df.dropna()
• df.dropna(axis=0, how='any', thresh=None, subset=None,
inplace=False)
– https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html

25

25

Data Cleaning (Fix #4)


• Inconsistent unit

• For inconsistent unit, we can write code to remove the unit like “m”
and convert to same base unit by multiplying. This involves
searching for the respective data then conversion.
• Some useful commands:
– if “m” in string_variable: # returns true if m exists in the string variable.
– Use float() or int() to convert text value to float or int.

26

26

13
3/8/22

You have learnt...


1. About several additional python libraries useful for analytics
2. About how to pre-process, clean and fix problem about data.
3. To use attributes and functions of dataframe.

27

27

14

You might also like