0% found this document useful (0 votes)

14 views14 pages

S08 Slides

Uploaded by

mathycheok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

S08 Slides

Uploaded by

mathycheok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

3/8/22

Seminar 8:
Data Preparation

Learning Objectives
In this lecture, you will learn
• Dataframe
• Data wrangling
• Data cleaning

1
3/8/22

How to make sense out of messy

data?

Picture from https://radiobruxelleslibera.files.wordpress.com/2014/04/111030-retention.png

Data
• Data are created from varying sources.
• Automatically or manually collected.
• Data may be:
– Redundant
– Inconsistent
– Inaccurate

2
3/8/22

Data Management

Create

Preserve Collect

Data
Derive Process / We are
Decision Transform here

Analyze

Essential python modules

• Python can do data transformation, conversion
readily with the right use of modules.
• Several modules are central to data management,
they are:
– numPy
– pandas
– matplotlib
– sciPy
• Anaconda comes with all these installed.
6

3
3/8/22

Numpy module
• Numpy stands for Numerical Python
• Essential package for data computation.
• Introduces the use of NumPy arrays for
compact and faster reading and writing
operations.
• Mainly used for data manipulation.
• It is the foundational library which SciPy,
Scikit-learn and etc are based on.
• Must-know library for data science.
7

Pandas module
• It’s a powerful and flexible open source for data
analysis.
• Provides rich set of data structures to work on
structured data.
• The primary object that we will be using is DataFrame
object.
• DataFrame is a two dimensional, table like structure
organised into column with header and row number
corresponds to each record.
• The other data structure in panda is Series for 1
dimensional data.
8

4
3/8/22

Scipy module
• For additional data processing.
• High level visualization, numerical
processing, and optimizations.
• Scipy contains several useful subpackages:
– cluster
– linear algebra
– stats

Matplotlib module
• The most popular Python library for visualization.
• It is widely accepted due to its ability to support
different operating system and output types.
• Can make interactive plotting and data exploration.
• pyplot in matplotlib package is the main interface for
plotting.
• Other modules are available that can wrap around
this matplotlib module to produce more powerful
visualization:
– seaborn
– ggpy
10

5
3/8/22

Import conventions
• To use the above modules, the analytics and
data science community has adopted the
following convention for consistency:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sub_module_name
11

Understanding data
• Examine the data as a whole.
• Understand the contexture meaning for
columns and rows.
• Total number of records.
• The format and unit of data.
• Look out for inconsistent formatting.
• Handling missing data.
12

6
3/8/22

Example
• What are the problems with these data
from csv file?

Problems
1. Missing headers
2. Duplicated record
3. Missing values
4. Inconsistent unit

7
3/8/22

Data Cleaning (Fix #1)

• pandas module is good at cleaning up
data.
• To read in csv data file, pandas also
comes with read csv option.
• We can first target on the missing header,
add in missing headers as follows:

Fix #1 explained

• The code started by importing the important pandas module for data
cleaning, and use pd as pointer to point to pandas.
• Header is defined using the list structure to hold multiple strings.
• Next calling read_csv() function from pandas through the pandas
pointer. Parse in the file name of csv file, together with header
previously defined and set as names of header.
• pd.read_csv() returns an important structure -> dataframe
• The last line of code displays the first 5 records of data from dataframe.
16

8
3/8/22

Dataframe
• Dataframe stores tabular data or 2D data into a single
variable.
• Each row corresponds to an observation, each column
corresponds to a variable.
• Features of dataframe:
– Mutable size: data can be added and shrank.
– Each data point is identifiable by row index number and
header name.
– Mathematic operations can be done on row or column.
• Constructs of dataframe:
– pandas.DataFrame( data, index, columns, dtype, copy)
17

Dataframe useful row operations

• Row is accessed through the use of index.
Getting specific row or rows (slicing) using
loc[index]:
Getting first row Getting second to forth rows

Ø df.drop(number) # drop certain row

Ø df.append(another_df) # to insert another dataframe into df

9
3/8/22

Dataframe useful column

operations
• Column is identified by column’s name.
• We could first use df.columns to get all columns’ names.

• Adding a column requires first the construction of another dataframe

with same structure then use “+” to add to existing dataframe.
– Eg. df = df + df will add the dataframe to itself and treats data as string.
• Deleting column using “pop(column_name)” remove a column.
– Eg. df.pop(‘acquiree’) # remove the first column 19

Data Cleaning (Fix #2)

• Problem #2 -> Duplicated data
Record 6 & 9 are
duplicates

10
3/8/22

Data Cleaning (Fix #2)

• df.duplicated() • Use drop_duplicates() from
returns Boolean dataframe to remove
value of True if it duplicates.
is a duplicated
record.
Before After

Data Cleaning (Fix #3)

• Problem #3 -> Missing values
• Missing values are common, speak to relevant person
on how missing values should be resolved.
• Some ways to handle missing values:
– Removal: remove the record with missing values
– Re-design data collection: to ensures every field is filled,
giving user options to select instead of allowing empty
value field.
– Substitute an appropriate value.
– Use appropriate replacement, for example mean value or
use values of highest occurrences .

11
3/8/22

Missing Value
Original Dataframe inserted sentinels
missing data like NaN for missing value

Missing value
• Fill in the missing value with acquisition_filled using fillna() function.

• The same can be done to fill it with mean by calling the mean() on
column variable with numeric value.

12
3/8/22

Missing value
• Other handling methods can be used accordingly:
– Fill missing value with appropriate statistical value,
eg. Median, mean, mode etc.
– Predict missing value using predictive model or
algorithm.
– Dropping the record
• df.dropna()
• df.dropna(axis=0, how='any', thresh=None, subset=None,
inplace=False)
– https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html

Data Cleaning (Fix #4)

• Inconsistent unit

• For inconsistent unit, we can write code to remove the unit like “m”
and convert to same base unit by multiplying. This involves
searching for the respective data then conversion.
• Some useful commands:
– if “m” in string_variable: # returns true if m exists in the string variable.
– Use float() or int() to convert text value to float or int.

13
3/8/22

You have learnt...

1. About several additional python libraries useful for analytics
2. About how to pre-process, clean and fix problem about data.
3. To use attributes and functions of dataframe.

20ec52i W1 1
No ratings yet
20ec52i W1 1
14 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Practice 1
No ratings yet
Practice 1
45 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Pandas
No ratings yet
Pandas
36 pages
Week 3 Python
No ratings yet
Week 3 Python
152 pages
Unit 3
No ratings yet
Unit 3
102 pages
Pandas
No ratings yet
Pandas
63 pages
Data Science Exam Prep-Unit 2
No ratings yet
Data Science Exam Prep-Unit 2
18 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Unit V
No ratings yet
Unit V
47 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Data Science - Sec4
No ratings yet
Data Science - Sec4
16 pages
Python - Scientific Functions
No ratings yet
Python - Scientific Functions
24 pages
UNIT II Material
No ratings yet
UNIT II Material
34 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Module 3
No ratings yet
Module 3
20 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Todas IA Ate 2023
No ratings yet
Todas IA Ate 2023
1,302 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Pandas
No ratings yet
Pandas
30 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
View Architecture Planning: Vmware Horizon 7
No ratings yet
View Architecture Planning: Vmware Horizon 7
104 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Associate Cloud Engineer Exam - Free Actual Q&As, Page 4 - ExamTopics
No ratings yet
Associate Cloud Engineer Exam - Free Actual Q&As, Page 4 - ExamTopics
3 pages
Zepto VS Blinkit
0% (1)
Zepto VS Blinkit
5 pages
Unit V Pandas AIML A B Lastupdated 18-06-2024
No ratings yet
Unit V Pandas AIML A B Lastupdated 18-06-2024
33 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
No ratings yet
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
72 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Week 2 - Data Exploration
No ratings yet
Week 2 - Data Exploration
8 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Datascience
No ratings yet
Datascience
26 pages
8086 Architecture, Pin Diagram, Addressing Modes
100% (1)
8086 Architecture, Pin Diagram, Addressing Modes
52 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
3rd Week Report
No ratings yet
3rd Week Report
7 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Document
No ratings yet
Document
29 pages
Pandas
No ratings yet
Pandas
7 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
React Fundamentals
No ratings yet
React Fundamentals
9 pages
IT Practical File CLASS 10
No ratings yet
IT Practical File CLASS 10
34 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
RAHAT AI AGENT - Docx - 20250215 - 173118 - 0000
No ratings yet
RAHAT AI AGENT - Docx - 20250215 - 173118 - 0000
57 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Prac 7
No ratings yet
Prac 7
5 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Practical 3
No ratings yet
Practical 3
2 pages
Final Theory Exam Practice Questions
No ratings yet
Final Theory Exam Practice Questions
2 pages
TmForum ODA
No ratings yet
TmForum ODA
42 pages
Subsquid Testnet Coinlist Co
100% (1)
Subsquid Testnet Coinlist Co
9 pages
CCpilot V700 - Technical Manual
No ratings yet
CCpilot V700 - Technical Manual
27 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Ifr6000 and Ifr6015 Usb Upgrade Product Information Letter Software Firmware Releases en
No ratings yet
Ifr6000 and Ifr6015 Usb Upgrade Product Information Letter Software Firmware Releases en
3 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
How To Set NTLM Authentication On Windows System Platform and NAStorage
No ratings yet
How To Set NTLM Authentication On Windows System Platform and NAStorage
6 pages
Pandas Notes
No ratings yet
Pandas Notes
6 pages
TELUS Digital - Reasoning Annotation Project Opportunity
No ratings yet
TELUS Digital - Reasoning Annotation Project Opportunity
3 pages
ThinkPad X1 Carbon Gen 11
No ratings yet
ThinkPad X1 Carbon Gen 11
3 pages
6.additonal Program-2 (FIRST FOLLOW)
No ratings yet
6.additonal Program-2 (FIRST FOLLOW)
6 pages
AP2152 - IT Support (How To Access VDI Environment)
No ratings yet
AP2152 - IT Support (How To Access VDI Environment)
15 pages
Discord 101 For Creators 1 2
No ratings yet
Discord 101 For Creators 1 2
1 page
Seeburger Integration Suit
No ratings yet
Seeburger Integration Suit
9 pages
NetApp Tools and Resources
No ratings yet
NetApp Tools and Resources
42 pages
2nd Sem FINAL EXAM
No ratings yet
2nd Sem FINAL EXAM
4 pages
Acadinfo
No ratings yet
Acadinfo
10 pages
Worksheet 2.11 Unit Testing
No ratings yet
Worksheet 2.11 Unit Testing
8 pages
Radio Fire Alarm Control Panel: Protocol
No ratings yet
Radio Fire Alarm Control Panel: Protocol
2 pages
Sheet No. Sheet Name: Hierarchical Block
No ratings yet
Sheet No. Sheet Name: Hierarchical Block
8 pages
Create Varchar Varchar Varchar Int
No ratings yet
Create Varchar Varchar Varchar Int
3 pages
Aksa Lte NW Assessment
100% (2)
Aksa Lte NW Assessment
43 pages
LRC Resources For : Animation, Interaction & Moving Image
No ratings yet
LRC Resources For : Animation, Interaction & Moving Image
8 pages
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet

S08 Slides

Uploaded by

S08 Slides

Uploaded by

3/8/22

How to make sense out of messy

Picture from https://radiobruxelleslibera.files.wordpress.com/2014/04/111030-retention.png

Essential python modules

Data Cleaning (Fix #1)

Dataframe useful row operations

Ø df.drop(number) # drop certain row

Dataframe useful column

• Adding a column requires first the construction of another dataframe

Data Cleaning (Fix #2)

Data Cleaning (Fix #2)

Data Cleaning (Fix #3)

Data Cleaning (Fix #4)

You have learnt...

You might also like