0% found this document useful (0 votes)

12 views40 pages

Data Preprocessing - 241024 - 215531

Uploaded by

Alaa Shorbaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Data Preprocessing - 241024 - 215531

Uploaded by

Alaa Shorbaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Data Cleansing

&
Exploratory
Data Analysis

Created by
Seaborn Kusto
Official Team

Malka Trianto Maulana Fauzi Rizki

Rusyd Haryo Ishaq Wardah Afrinal

Abdussalam Nugroho Siregar Ali

Table of Data Cleansing

Content Exploratory Data Analysis

Data Cleansing
Presentations are communication tools that can be used as demonstrations, lectures, spee
Importing Libraries
We will start by importing the libraries we will require for data cleansing. These include Pandas, NumPy,
Matplotlib, and Seaborn
Pandas is a Python library for data analysis.
Numpy is a Python library for mathematical operations.
Matplotlib is a Python library for data visualization.
Seaborn is a Python library for data visualization and exploratory data analysis.
Uploading Dataset
The next step after importing libraries is uploading the dataset from the storage location to google colab.
The dataset that we will use is the Titanic.csv dataset obtained from Kaggle.

Reading Dataset
After the dataset is uploaded then we read the contents of the dataset.
Displaying DataFrame Information
The info() method prints information
about the DataFrame. The information
contains the number of columns,
column labels, column data types,
memory usage, range index, and the
number of cells in each column (non-
null values). Note: the info() method
actually prints the info.

We have to check the state of data in

each variable/column before taking any
action in data manipulation and data
cleaning
Column Age
The describe() method prints information about the data description starting with count, mean,
standard deviation, min, and max. The value_counts() method prints information counting for each
value. And the plot() method display graphs according to the format we specify
Column Age
because the Age coulumn shows skewness, so we
have to apply an imputation with median, and we
can check data information again.
Column Cabin
the sum of data entry is 891 yet the Cabin Column is 204. it means there must be NULL in the Cabin
column. and we can show Cabin column proportion

It showed that Cabin column value has too much unique

data and also the Cabin clumn info is not quite
informative to describe survived data. So we better
remove the Cabin column
Column Embarked
the sum of data entry is 891 yet the Embarked Column is 889. We can show the Embarked column
proportion. Embarked column is categorical data

we will apply an imputation on Embarked column. so we check the data type of the EMbarked
column first. Embarked column is categorical data so the imputation is using mode. from the
proportion of EMbarked column, S appeared the most. so S is the mode
After all data cleansing. we can check information data. There's no more missing value
Data Manipulation
Column SibSp and Column Parch
We will do data manipulation. Manipulation here doesn't mean changing the data value but to
ease a machine to read data. SibSp column (sibling Spouse) ia a column that state the number of
siblings or partner came with the pessenger. Parch(Parent Childern) column is a column that state
the number of parents or children came with the pessenger.
we will make a new column that shows whether the
pessenger is alone or coming with their family.
So we can show the new data
Data Visualization
Realtion between sex column and survived column
let's see the survived Sex column proportion and compare it with the Sex column which is not
survived. And we can show the visualization of Sex column which is surivived and not survived
Exploratory
Data Analysis
Exploratory Data Analysis, or EDA, is an important step in
any Data Analysis or Data Science project. EDA is the
process of investigating the dataset to discover patterns,
and anomalies (outliers), and form hypotheses based on
our understanding of the dataset.

EDA involves generating summary statistics for

numerical data in the dataset and creating various
graphical representations to understand the data better.
Before we delve into EDA, it is important to first get a sense of where EDA fits in the whole data
science process.
Importing Libraries
We will start by importing the libraries we will require for performing EDA. These include Pandas, NumPy,
Matplotlib, and Seaborn
Pandas is a Python library for data analysis.
Numpy is a Python library for mathematical operations.
Matplotlib is a Python library for data visualization.
Seaborn is a Python library for data visualization and exploratory data analysis.
Uploading Dataset
The next step after importing libraries is uploading the dataset from the storage location to google colab.
The dataset that we will use is the Titanic.csv dataset obtained from Kaggle.

Reading Dataset
After the dataset is uploaded then we read the contents of the dataset.
Displaying Top 5 Rows
Returns the top 5 rows of the dataset to have a look at how our dataset looks like.
Changing Index
Pandas default index starts from 0, while the dataset index of the PassengerId column starts from 1. Then
we will use the index dataset of column PassengerId.
Displaying DataFrame Information
The info() method prints information about the DataFrame. The information contains the number of
columns, column labels, column data types, memory usage, range index, and the number of cells in each
column (non-null values). Note: the info() method actually prints the info.
Checking Missing Value (NaN)
Checking whether there is a missing value (NaN) and also counting the number of the missing value in
each columns in the dataset.
Displaying Descriptive Statistics
Looking at descriptive statistic parameters for the dataset: Count, Mean, Standard Deviation, Maximum
and Minimum , Quartile (25%, 50% and 75%).
Displaying Unique Values in a Column
Displaying all of the unique value and its data types in a Column.
Displaying Proportion of Unique Values
Displays the data proportion of its unique values for the categoric data type.
Displaying Shape (Number of Rows and Columns)
Displays the data proportion of its unique values for the categoric data type.

Checking Duplicate Data

Checking the number of duplicate Data for each column.
Removing Duplicate Data
Remove all of duplicate data from the dataset.
Embarked Column
Embarked column has 2 null data on PassangerId numer 62 and 830
Embarked Column
Embarked is categoric data, we can use mode for imputation missing data in Embarked Column

Proportion Embarked
Before Imputation After Imputation
Embarked Column
Change the object data in Embraked ('S', 'C', 'Q') to numerical data (0, 1, 2)
Age Column
Data Titanic has 891 row, in Age Column only 714 row, its mean Age
Column has 177 missing data
Age Column
Age Colum has Skewness Distribution, because of that we can use
median for imputaion missing data
Age Column
Visualitation data Age Column

in this plot we can see the outliers, in this case the

outliers are passengers aged in range (0, +-5) and more
than +- 55
Sex Column
Data in Sex column only has two unique data, that is male and female . we want
to convert data object to data numerical

0 for male and 1 for female

Drop Data
Drop data is used when the existing data is uninformative and has a lot of unique data.,
in this case we use drop data in Cabin Column, Name Column and Ticket Column
Data Survived Visualitation
Proportion of Survived data
Data Survived Visualization
Making data frame from Survived data
Data Survived Visualization
Show chart using matplotlib.pyplot module
Data Survived Visualization
Show chart using seaborn module
Thank
You
Seaborn Kusto

reallygreatsite.com

Employee Engagement Survey - Proposal
No ratings yet
Employee Engagement Survey - Proposal
7 pages
Agitation Laboratory Report
100% (4)
Agitation Laboratory Report
34 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
No ratings yet
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
12 pages
Phython Example
No ratings yet
Phython Example
12 pages
AI Lab5
No ratings yet
AI Lab5
5 pages
EDA - Final
No ratings yet
EDA - Final
7 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
8 Data Visualization
No ratings yet
8 Data Visualization
12 pages
Pandas - Data Manipulation and Analysis Library - Educative
No ratings yet
Pandas - Data Manipulation and Analysis Library - Educative
7 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Passengerid Survived Pclass Name Sex Age Sibsp Parch Ticket
No ratings yet
Passengerid Survived Pclass Name Sex Age Sibsp Parch Ticket
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Prac3 23bme053
No ratings yet
Prac3 23bme053
5 pages
Dspracticalexternak 23 Aug
No ratings yet
Dspracticalexternak 23 Aug
8 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
FDS Practical 2
No ratings yet
FDS Practical 2
8 pages
CS1010S Lecture 11 - Visualising Data
No ratings yet
CS1010S Lecture 11 - Visualising Data
68 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Import Import As Import As: #Default To CSV
No ratings yet
Import Import As Import As: #Default To CSV
6 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Titanic
No ratings yet
Titanic
22 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
The Titanic Dataset
No ratings yet
The Titanic Dataset
6 pages
DVA Practical
No ratings yet
DVA Practical
19 pages
Datacleaning Py
No ratings yet
Datacleaning Py
4 pages
Lec 18
No ratings yet
Lec 18
17 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Another Cheatsheet
No ratings yet
Another Cheatsheet
7 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Data Understanding and Preparation
No ratings yet
Data Understanding and Preparation
48 pages
6) Exploratory Data Analysis
No ratings yet
6) Exploratory Data Analysis
29 pages
Unit 5 Analysis With Pandas in Python
No ratings yet
Unit 5 Analysis With Pandas in Python
26 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
REPORT - Assignment 1
No ratings yet
REPORT - Assignment 1
2 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
AL Notes
No ratings yet
AL Notes
61 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
203 pages
300+ Python Algorithms: Mastering the Art of Problem-Solving
From Everand
300+ Python Algorithms: Mastering the Art of Problem-Solving
Hernando Abella
5/5 (1)
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Ctrl+Shift+Enter Mastering Excel Array Formulas: Do the Impossible with Excel Formulas Thanks to Array Formula Magic
From Everand
Ctrl+Shift+Enter Mastering Excel Array Formulas: Do the Impossible with Excel Formulas Thanks to Array Formula Magic
Mike Girvin
4/5 (11)
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
No ratings yet
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
6 pages
Ownership and Possession
No ratings yet
Ownership and Possession
11 pages
Charles Oman
No ratings yet
Charles Oman
49 pages
C Lab Manual
No ratings yet
C Lab Manual
43 pages
(MDS-G6) PMS
No ratings yet
(MDS-G6) PMS
22 pages
Michael's Resume 2024
No ratings yet
Michael's Resume 2024
3 pages
Macronix MX25L12855FXCI 10G Datasheet
No ratings yet
Macronix MX25L12855FXCI 10G Datasheet
15 pages
Certificate
No ratings yet
Certificate
1 page
New Methos For Granulating Diamond and Powder
No ratings yet
New Methos For Granulating Diamond and Powder
2 pages
Gold Minimalistic Professional Business Letterhead Template PDF
No ratings yet
Gold Minimalistic Professional Business Letterhead Template PDF
8 pages
Industrial Internship Report ON Fundamental Analysis of Indian Steel Industry
No ratings yet
Industrial Internship Report ON Fundamental Analysis of Indian Steel Industry
60 pages
2 2 2
No ratings yet
2 2 2
4 pages
Elgamatic 100
No ratings yet
Elgamatic 100
1 page
Avaya 9641GS IP Deskphone: Phones & Devices
No ratings yet
Avaya 9641GS IP Deskphone: Phones & Devices
4 pages
Admit Card
No ratings yet
Admit Card
3 pages
Disec Study Guide Aurora Mun
No ratings yet
Disec Study Guide Aurora Mun
28 pages
ACC30 Accounting For Partnerships Lesson Plan Liquidation
0% (1)
ACC30 Accounting For Partnerships Lesson Plan Liquidation
4 pages
Presumption of Constitutionality
No ratings yet
Presumption of Constitutionality
17 pages
Daily Accomplishment Report
No ratings yet
Daily Accomplishment Report
13 pages
Lab 1 Answers
No ratings yet
Lab 1 Answers
3 pages
SBR - Chapter 1
No ratings yet
SBR - Chapter 1
2 pages
Package Desire': R Topics Documented
No ratings yet
Package Desire': R Topics Documented
22 pages
80407049830 (5)
No ratings yet
80407049830 (5)
2 pages
Safety and Instruction Manual: Meat Grinder
No ratings yet
Safety and Instruction Manual: Meat Grinder
20 pages
1.1 Mechanical Tender Drawing For Sanwa Project (R2)
No ratings yet
1.1 Mechanical Tender Drawing For Sanwa Project (R2)
9 pages
ART 101 Syllabus - Summer - 2025
No ratings yet
ART 101 Syllabus - Summer - 2025
5 pages
WI For FMEA
No ratings yet
WI For FMEA
2 pages
(Student Version) 91264 - 2023 - Anything Is Popsicle
No ratings yet
(Student Version) 91264 - 2023 - Anything Is Popsicle
4 pages

Data Preprocessing - 241024 - 215531

Uploaded by

Data Preprocessing - 241024 - 215531

Uploaded by

Data Cleansing

Malka Trianto Maulana Fauzi Rizki

Rusyd Haryo Ishaq Wardah Afrinal

Abdussalam Nugroho Siregar Ali

Content Exploratory Data Analysis

We have to check the state of data in

It showed that Cabin column value has too much unique

EDA involves generating summary statistics for

Checking Duplicate Data

in this plot we can see the outliers, in this case the

0 for male and 1 for female

You might also like