Fall 2024 - Project - CEP
Fall 2024 - Project - CEP
Course project of the Data Pre-processing with Python course has been designed as a Complex Engineering
Problem (CEP).
Project Title:
Design and Optimization of a Data Preprocessing Pipeline for Machine Learning Applications
Project Statement:
In this project, students will design, implement, and evaluate a comprehensive data preprocessing pipeline to prepare
a dataset for machine learning applications. This project addresses the challenges of handling missing values,
removing outliers, and optimizing data transformation techniques to ensure robust model performance. Students will
work with a real-world dataset, applying theoretical knowledge and practical skills to design innovative
preprocessing solutions that balance conflicting requirements such as computational efficiency, data integrity, and
model accuracy.
Objectives:
1. Develop a deep understanding of advanced data preprocessing techniques and their role in machine learning.
2. Equip students with hands-on experience in handling real-world dataset challenges.
3. Foster innovative thinking to balance trade-offs in preprocessing strategies.
4. Enhance problem-solving skills through iterative design, implementation, and evaluation.
Project Phases:
Phase 1: Data Exploration and Problem Framing (Relevant WP: WP2 - Range of Conflicting Requirements)
• Select a real-world dataset from platforms like Kaggle or UCI Machine Learning Repository.
• Identify and document challenges related to missing values, outliers, and feature representation.
Phase 2: Feature Engineering and Data Transformation (Relevant WP: WP1 - Depth of Knowledge Required,
WP3 - Depth of Analysis Required)
• Engineer features using techniques like one-hot encoding, label encoding, and feature scaling.
• Implement dimensionality reduction techniques such as PCA to address the curse of dimensionality.
• Justify the selection of transformation methods for the dataset.
Phase 3: Handling Missing and Noisy Data (Relevant WP: WP2 - Range of Conflicting Requirements)
• Apply multiple imputation techniques (e.g., KNN, iterative imputer) for missing data.
• Identify and remove outliers using Z-score and IQR methods.
Phase 5: Model Performance and Preprocessing Impact Analysis (Relevant WP: WP3 - Depth of Analysis
Required)
• Evaluate the performance of machine learning models trained on preprocessed data.
• Compare results across multiple preprocessing strategies.
Evaluation Criteria:
Category Weightage (%) Mapped WPs
Dataset Selection & 20% WP2
Problem Framing
Feature Engineering 15% WP1, WP3
Handling Missing and 20% WP2
Noisy Data
Pipeline Design 20% WP1, WP3
Final Report & 25%
Presentation
=====================================Ended=======================================
GOOD LUCK