CSC407 - Chapter 2-3
CSC407 - Chapter 2-3
(csc 407)
Chapters 2 & 3
Taiwo Kolajo (PhD)
Department of Computer Science
Federal University Lokoja
+2348031805049
2: Data Preprocessing
• Importance of Data Quality
• Data is a crucial component in the field of Machine Learning.
• It refers to the set of observations or measurements that can be
used to train a machine-learning model.
• The quality and quantity of data available for training and testing
play a significant role in determining the performance of a
machine-learning model.
• Data is the most important part of all Data Analytics, Machine
Learning, and Artificial Intelligence. Without data, we can’t train
any model, and all modern research and automation will go in
vain.
• For decision-making, the integrity of the conclusions drawn heavily
relies on the cleanliness of the underlying data.
• Without proper data cleaning, inaccuracies, outliers, missing
values, and inconsistencies can compromise the validity of
analytical results.
• Moreover, clean data facilitates more effective modelling and
pattern recognition, as algorithms perform optimally when fed
2: Data Preprocessing
• Importance of Data Quality Cont’d
• Additionally, clean datasets enhance the
interpretability of findings, aiding in the
formulation of actionable insights.
• Big Enterprises are spending lots of money just to
gather as much certain data as possible.
Example: Why did Facebook acquire WhatsApp by
paying a huge price of $19 billion?
• The answer is very simple and logical – it is to
have access to the users’ information that
Facebook may not have but WhatsApp will have.
This information about their users is of paramount
importance to Facebook as it will facilitate the
task of improvement in their services.
2: Data Preprocessing
• Different Forms of Data
• Numeric Data: If a feature represents a characteristic
measured in numbers, it is called a numeric feature.
• Categorical Data: A categorical feature is an
attribute that can take on one of the limited, and
usually fixed number of possible values on the basis of
some qualitative property. A categorical feature is also
called a nominal feature. Example: gender, colour,
race,
• Ordinal Data: This denotes a nominal variable with
categories falling in an ordered list . Examples include
clothing sizes such as small, medium, and large, or a
measurement of customer satisfaction on a scale from
“not at all happy” to “very happy”.
2: Data Preprocessing
• Advantages of data processing in Machine Learning:
• Improved model performance: Data processing helps improve the
performance of the ML model by cleaning and transforming the data
into a format that is suitable for modeling.
• Better representation of the data: Data processing allows the data to be
transformed into a format that better represents the underlying
relationships and patterns in the data, making it easier for the ML model
to learn from the data.
• Increased accuracy: Data processing helps ensure that the data is
accurate, consistent, and free of errors, which can help improve the
accuracy of the ML model.
• Disadvantages of data processing in Machine Learning:
• Time-consuming: Data processing can be a time-consuming task,
especially for large and complex datasets.
• Error-prone: Data processing can be error-prone, as it involves
transforming and cleaning the data, which can result in the loss of
important information or the introduction of new errors.
• Limited understanding of the data: Data processing can lead to a limited
understanding of the data, as the transformed data may not be
representative of the underlying relationships and patterns in the data.
2: Data Preprocessing
• Steps to Perform Data Cleanliness
• Performing data cleaning involves a
systematic process to identify and
rectify errors, inconsistencies, and
inaccuracies in a dataset.
2: Data Preprocessing
• The following are essential steps to perform data
cleaning:
• Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing data
entries for duplicate records, irrelevant information, or data points that do not
contribute meaningfully to the analysis. Removing unwanted observations
streamlines the dataset, reducing noise and improving the overall quality.
• Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity in data
representation. Fixing structure errors enhances data consistency and facilitates
accurate analysis and interpretation.
• Managing Unwanted outliers: Identify and manage outliers, which are data
points significantly deviating from the norm. Depending on the context, decide
whether to remove outliers or transform them to minimize their impact on
analysis. Managing outliers is crucial for obtaining more accurate and reliable
insights from the data.
• Handling Missing Data: Devise strategies to handle missing data effectively.
This may involve imputing missing values based on statistical methods,
removing records with missing values, or employing advanced imputation
techniques. Handling missing data ensures a more complete dataset, preventing
biases and maintaining the integrity of analyses.
2: Data Preprocessing
• Python Implementation for Database Cleaning
• Let’s understand each step for Database Cleaning,
using titanic dataset.
• Below are the necessary steps:
• Import the necessary libraries
• Load the dataset
• Check the data information using df.info()
Output:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool
From the above data info, we can see that Age and Cabin
have an unequal number of counts. And some of the columns
are categorical and have data type objects and some are
2: Data Preprocessing
• Check the Categorical and Numerical Columns.
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)
Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare’]
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
2: Data Preprocessing
• Removal of all Above Unwanted
Observations
• This includes deleting duplicate/ redundant or irrelevant
values from your dataset. Duplicate observations most
frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific
problem that you’re trying to solve.
• Redundant observations alter the efficiency to a great extent
as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful
results.
• Irrelevant observations are any type of data that is of no use
to us and can be removed directly.
• Here we are dropping the Name columns because the Name
will be always unique and it hasn’t a great influence on target
variables. For the ticket, Let’s first print the 50 unique tickets.
df['Ticket'].unique()[:50]
2: Data Preprocessing
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
2: Data Preprocessing
• Handling Missing Data
• Missing data is a common issue in real-world datasets, and it can occur due to various reasons
such as human errors, system failures, or data collection issues. Various techniques can be
used to handle missing data, such as imputation, deletion, or substitution.
• Let’s check the % missing values columns-wise for each row using df.isnull()
• it checks whether the values are null or not and gives returns boolean values.
• .sum() will sum the total number of null values rows and we divide it by the total number of
rows present in the dataset then we multiply to get values in % i.e per 100 values how much
values are null.
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as they
2: Data Preprocessing
• Handling Missing Data
• The two most common ways to deal with missing data are:
• Dropping Observations with missing values.
• Imputing the missing values from past observations.
• So, it’s not a good idea to fill 77% of null values. So, we will drop
the Cabin column. Embarked column has only 0.22% of null values
so, we drop the null values rows of Embarked column.
df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape
Output:
(889, 9)
2: Data Preprocessing
• Handling Missing Data
• Imputing the missing values from past observations.
• Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real information.
You’re just reinforcing the patterns already provided by other features.
• We can use Mean imputation or Median imputations for the case.
• Note:
• Mean imputation is suitable when the data is normally distributed and has no extreme outliers.
• Median imputation is preferable when the data contains outliers or is skewed.
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
2: Data Preprocessing
• Handling Outliers
• Outliers are extreme values that deviate significantly from the majority of
the data.
• They can negatively impact the analysis and model performance.
• Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.
Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Output:
(1599, 12)
3: Exploratory Data Analysis (EDA)
• info() facilitates comprehension of the data type and
related information, such as the quantity of records in
each column, whether the data is null or not, the type of
data, and the dataset’s memory use.
#data information
df.info()
3: Exploratory Data Analysis (EDA)
# describing the data
df.describe()
#column to list
df.columns.tolist()
Here, this count plot graph shows the count of the wine with its
3: Exploratory Data Analysis (EDA)
• Kernel density plot
# Set Seaborn style
sns.set_style("darkgrid")
Skewness is depicted by
observing whether the
3: Exploratory Data Analysis (EDA)
• Bivariate Analysis
• Violin plot
plt.figure(figsize=(10, 8))
plt.title('Correlation Heatmap')
plt.show()
3: Exploratory Data Analysis (EDA)
• Multivariate Analysis
Interpreting the
correlation matrix
plot
• Values close to +1
indicates strong
positive correlation, -1
indicates a strong
negative correlation
and 0 indicates
suggests no linear
correlation.
• Darker colors signify
strong correlation,
while light colors
represents weaker
correlations.
• Positive correlation
variable move in same
directions. As one
3: Exploratory Data Analysis (EDA)
• In summary, the Python-based exploratory data analysis (EDA)
of the wine dataset has yielded important new information about
the properties of the wine samples.
• We investigated correlations between variables, identified
outliers, and obtained a knowledge of the distribution of
important features using statistical summaries and
visualizations.
• The quantitative and qualitative features of the dataset were
analyzed in detail through the use of various plots, including
pair, box, and histogram plots. Finding patterns, trends, and
possible topics for more research was made easier by this EDA
method.
• Furthermore, the analysis demonstrated the ability to visualize
and analyze complicated datasets using Python tools such as
Matplotlib, Seaborn, and Pandas.
• The results provide a thorough grasp of the wine dataset and lay