Data Preprocessing - 241024 - 215531
Data Preprocessing - 241024 - 215531
&
Exploratory
Data Analysis
Created by
Seaborn Kusto
Official Team
Reading Dataset
After the dataset is uploaded then we read the contents of the dataset.
Displaying DataFrame Information
The info() method prints information
about the DataFrame. The information
contains the number of columns,
column labels, column data types,
memory usage, range index, and the
number of cells in each column (non-
null values). Note: the info() method
actually prints the info.
we will apply an imputation on Embarked column. so we check the data type of the EMbarked
column first. Embarked column is categorical data so the imputation is using mode. from the
proportion of EMbarked column, S appeared the most. so S is the mode
After all data cleansing. we can check information data. There's no more missing value
Data Manipulation
Column SibSp and Column Parch
We will do data manipulation. Manipulation here doesn't mean changing the data value but to
ease a machine to read data. SibSp column (sibling Spouse) ia a column that state the number of
siblings or partner came with the pessenger. Parch(Parent Childern) column is a column that state
the number of parents or children came with the pessenger.
we will make a new column that shows whether the
pessenger is alone or coming with their family.
So we can show the new data
Data Visualization
Realtion between sex column and survived column
let's see the survived Sex column proportion and compare it with the Sex column which is not
survived. And we can show the visualization of Sex column which is surivived and not survived
Exploratory
Data Analysis
Exploratory Data Analysis, or EDA, is an important step in
any Data Analysis or Data Science project. EDA is the
process of investigating the dataset to discover patterns,
and anomalies (outliers), and form hypotheses based on
our understanding of the dataset.
Reading Dataset
After the dataset is uploaded then we read the contents of the dataset.
Displaying Top 5 Rows
Returns the top 5 rows of the dataset to have a look at how our dataset looks like.
Changing Index
Pandas default index starts from 0, while the dataset index of the PassengerId column starts from 1. Then
we will use the index dataset of column PassengerId.
Displaying DataFrame Information
The info() method prints information about the DataFrame. The information contains the number of
columns, column labels, column data types, memory usage, range index, and the number of cells in each
column (non-null values). Note: the info() method actually prints the info.
Checking Missing Value (NaN)
Checking whether there is a missing value (NaN) and also counting the number of the missing value in
each columns in the dataset.
Displaying Descriptive Statistics
Looking at descriptive statistic parameters for the dataset: Count, Mean, Standard Deviation, Maximum
and Minimum , Quartile (25%, 50% and 75%).
Displaying Unique Values in a Column
Displaying all of the unique value and its data types in a Column.
Displaying Proportion of Unique Values
Displays the data proportion of its unique values for the categoric data type.
Displaying Shape (Number of Rows and Columns)
Displays the data proportion of its unique values for the categoric data type.
Proportion Embarked
Before Imputation After Imputation
Embarked Column
Change the object data in Embraked ('S', 'C', 'Q') to numerical data (0, 1, 2)
Age Column
Data Titanic has 891 row, in Age Column only 714 row, its mean Age
Column has 177 missing data
Age Column
Age Colum has Skewness Distribution, because of that we can use
median for imputaion missing data
Age Column
Visualitation data Age Column
reallygreatsite.com