0% found this document useful (0 votes)
35 views3 pages

CS202 Assignment - 4 - GIKI

The assignment for Ghulam Ishaq Khan Institute (GIKI) focuses on dataset preprocessing and visualization using R, with a submission deadline of December 21, 2024. Students are required to select a dataset, perform data cleaning and exploratory data analysis, and create visualizations to derive insights. Deliverables include R code, relevant visualizations, and a brief report summarizing the dataset and findings, with grading criteria based on code quality, correctness, analysis, and creativity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views3 pages

CS202 Assignment - 4 - GIKI

The assignment for Ghulam Ishaq Khan Institute (GIKI) focuses on dataset preprocessing and visualization using R, with a submission deadline of December 21, 2024. Students are required to select a dataset, perform data cleaning and exploratory data analysis, and create visualizations to derive insights. Deliverables include R code, relevant visualizations, and a brief report summarizing the dataset and findings, with grading criteria based on code quality, correctness, analysis, and creativity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Ghulam Ishaq Khan Institute

(GIKI)
Assignment # 4
Subject: ICT Course Code: CS202
Class: BS cys-3rd Submission Deadline: 21/Dec/2024

Instructor: M Talha Ashfaq Total Marks: 100

Note (Read notes & instructions first) .

CHEATING/COPY CASE or LATE SUBMISSION even 1 minute late will be graded as STRAIGHT ZERO MARKS .
So be on time make no excuse.

(Read Carefully)

Dataset Preprocessing and Visualization in R


Objective:

This assignment aims to help students understand the fundamental steps in dataset preprocessing and data
visualization. Students will select a dataset, clean and preprocess the data, explore it using visualizations, and
provide insights. This will help develop skills in data manipulation, cleaning, and visualization — essential tasks in
data analysis and machine learning.

Dataset Selection:

Choose any dataset that has both numerical and categorical variables (e.g., from Kaggle, UCI, or any public
dataset repository).

You may consider datasets such as:

● Iris dataset (classification of iris species based on flower features).


● Titanic dataset (predict survival based on passenger features).
● Wine dataset (classification of wine quality).
● Breast Cancer dataset (diagnosis of cancerous or non-cancerous samples).

Note: Ensure the dataset has at least 3 features and a target variable for classification or regression tasks.

Steps to Follow:

1. Data Importing and Preprocessing

Load the Dataset: Import the dataset into R using read.csv() or readr::read_csv().

Data Cleaning:

● Missing Values: Identify and handle missing data by either removing rows/columns or imputing missing
values (e.g., using the mean or median for numerical data).
● Duplicate Rows: Check for and remove any duplicate rows using duplicated().
● Categorical Encoding: Convert categorical variables into factors, if necessary.

2. Exploratory Data Analysis (EDA)

Summary Statistics: Display summary statistics of the numerical features (e.g., mean, median, standard
deviation, etc.).
Target Variable Analysis: Examine the distribution of the target variable.

● If the target variable is categorical (e.g., yes/no, class1/class2), visualize the count distribution.

Feature Distribution:

● For numerical features, plot histograms or box plots to understand their distribution.
● For categorical features, use bar plots to show the frequency of each category.

3. Data Visualization

Correlation Matrix: For numerical features, plot a correlation matrix to understand relationships between
variables.

Scatter Plots: Visualize relationships between pairs of numerical features.

Boxplots for Target vs. Feature: For numerical features, create boxplots to compare the distribution of the
feature across different classes of the target variable.

4. Feature Engineering

Feature Scaling: If your data contains features with very different scales, consider normalizing or standardizing
the features (especially important for models like k-NN).

Create New Features: Based on your understanding of the dataset, you may try creating new features that could
help improve model performance (e.g., ratios, combinations of existing features, etc.).

5. Model Preparation

Split Data into Train and Test Sets: Split the data into training and testing sets (usually an 80/20 or 70/30 split).

6. Documentation and Reporting

Write a brief report that explains your findings and insights from the EDA.

In your report, include the following:

● Dataset Description: Provide the name of the dataset, source, and a brief overview of the features and
target variable.
● Key Findings from EDA:
○ Summary statistics of the numerical features.
○ Distributions of categorical and numerical features.
○ Insights drawn from the correlation matrix or visualizations.
● Data Cleaning: Describe how you handled missing values and duplicates, if any.
● Visualizations: Present the most important visualizations that highlight the key patterns in the data.
● Next Steps: Suggest possible further analyses, improvements, or models that could be applied.

Deliverables:

Code: Submit your R script or R Markdown file that includes:

● Data import and preprocessing steps.


● Visualizations (e.g., histograms, box plots, scatter plots).
● Feature engineering (if done).
● Any necessary comments to explain your code.

Visualizations: Include relevant visualizations that help explain the data (e.g., plots showing distributions,
relationships between features, etc.).

Report: Submit a brief report (max 2 pages) with:

● Dataset description.
● Key findings from the EDA.
● Insights from the visualizations.
● Any additional steps for further analysis.

Grading Criteria:

● Code Quality (30%): The code should be clear, efficient, and well-commented.
● Correctness (40%): Proper handling of missing values, duplicates, and correct usage of visualization
techniques.
● Analysis (20%): A thorough and insightful EDA that includes meaningful visualizations and
interpretations.
● Creativity (10%): Innovative approaches to data exploration, feature creation, or visualization.

You might also like