CS202 Assignment - 4 - GIKI
CS202 Assignment - 4 - GIKI
(GIKI)
Assignment # 4
Subject: ICT Course Code: CS202
Class: BS cys-3rd Submission Deadline: 21/Dec/2024
CHEATING/COPY CASE or LATE SUBMISSION even 1 minute late will be graded as STRAIGHT ZERO MARKS .
So be on time make no excuse.
(Read Carefully)
This assignment aims to help students understand the fundamental steps in dataset preprocessing and data
visualization. Students will select a dataset, clean and preprocess the data, explore it using visualizations, and
provide insights. This will help develop skills in data manipulation, cleaning, and visualization — essential tasks in
data analysis and machine learning.
Dataset Selection:
Choose any dataset that has both numerical and categorical variables (e.g., from Kaggle, UCI, or any public
dataset repository).
Note: Ensure the dataset has at least 3 features and a target variable for classification or regression tasks.
Steps to Follow:
Load the Dataset: Import the dataset into R using read.csv() or readr::read_csv().
Data Cleaning:
● Missing Values: Identify and handle missing data by either removing rows/columns or imputing missing
values (e.g., using the mean or median for numerical data).
● Duplicate Rows: Check for and remove any duplicate rows using duplicated().
● Categorical Encoding: Convert categorical variables into factors, if necessary.
Summary Statistics: Display summary statistics of the numerical features (e.g., mean, median, standard
deviation, etc.).
Target Variable Analysis: Examine the distribution of the target variable.
● If the target variable is categorical (e.g., yes/no, class1/class2), visualize the count distribution.
Feature Distribution:
● For numerical features, plot histograms or box plots to understand their distribution.
● For categorical features, use bar plots to show the frequency of each category.
3. Data Visualization
Correlation Matrix: For numerical features, plot a correlation matrix to understand relationships between
variables.
Boxplots for Target vs. Feature: For numerical features, create boxplots to compare the distribution of the
feature across different classes of the target variable.
4. Feature Engineering
Feature Scaling: If your data contains features with very different scales, consider normalizing or standardizing
the features (especially important for models like k-NN).
Create New Features: Based on your understanding of the dataset, you may try creating new features that could
help improve model performance (e.g., ratios, combinations of existing features, etc.).
5. Model Preparation
Split Data into Train and Test Sets: Split the data into training and testing sets (usually an 80/20 or 70/30 split).
Write a brief report that explains your findings and insights from the EDA.
● Dataset Description: Provide the name of the dataset, source, and a brief overview of the features and
target variable.
● Key Findings from EDA:
○ Summary statistics of the numerical features.
○ Distributions of categorical and numerical features.
○ Insights drawn from the correlation matrix or visualizations.
● Data Cleaning: Describe how you handled missing values and duplicates, if any.
● Visualizations: Present the most important visualizations that highlight the key patterns in the data.
● Next Steps: Suggest possible further analyses, improvements, or models that could be applied.
Deliverables:
Visualizations: Include relevant visualizations that help explain the data (e.g., plots showing distributions,
relationships between features, etc.).
● Dataset description.
● Key findings from the EDA.
● Insights from the visualizations.
● Any additional steps for further analysis.
Grading Criteria:
● Code Quality (30%): The code should be clear, efficient, and well-commented.
● Correctness (40%): Proper handling of missing values, duplicates, and correct usage of visualization
techniques.
● Analysis (20%): A thorough and insightful EDA that includes meaningful visualizations and
interpretations.
● Creativity (10%): Innovative approaches to data exploration, feature creation, or visualization.