ETI Microproject
ETI Microproject
SIXTH SEMESTER
(Year: 2024-25)
Micro Project
CERTIFICATE
This is to certify that the micro project report entitled
“Data Preparation And Cleaning.”
Submitted by
Sr. No. Name of Student Roll No.
For Sixth Semester of Diploma inartificial Intelligence & Machine Learning of course of
Big Data Analytics (22684) for academic year 2024-25 as per MSBTE, Mumbai curriculum
of ‘I’ scheme.
DIPLOMA OFENGINEERING
(Artificial Intelligence & Machine Learning)
SUBMITTED TO
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION MUMBAI
ACADEMIC YEAR 2024-25
MICRO PROJECT
Progress Report / Weekly Report
Sign of the
Week No Date Duration in Hrs. Work / Activity Performed
Faculty
1 30 min Knowing the basic
Identify common data issues such as missing values, duplicates, and inconsistencies. \
CO c:
CO d: Identify common data issues such as missing values, duplicates, and inconsistencies.
CO e: Demonstrate improved decision-making capabilities through the use of clean and structured data
Marks:-
Marks obtained
Marks by the Total
Roll Name Of Student for individual Marks
No. Group based on viva (10)
Work (04)
(06)
3104 Akanksha Dhanaji Kadam
Signature:
ACKNOWLEDGEMENT
I am extremely thankful to Principal Dr. P. S. Patil for this motivation and providing me
infrastructural facilities to work in, without which this work would not have been possible.
I would like to express my gratitude to all my colleagues for their support, co-operation and
fruitful discussions on diverse seminar topics and technical help.
1.0 Rationale
In the realm of data analysis and machine learning, data preparation and cleaning are foundational
steps that significantly influence the accuracy and reliability of results. Raw data collected from
various sources—such as surveys, sensors, logs, or online platforms—often contain inconsistencies,
missing values, duplicates, and errors that must be addressed before meaningful analysis can begin.
Data preparation involves transforming raw data into a structured and usable format. This includes
tasks like data integration, formatting, normalization, and transformation. On the other hand, data
cleaning focuses specifically on identifying and rectifying data quality issues, such as incorrect
entries, outliers, and missing or duplicated values
The primary aim of this micro-project is to understand and apply the techniques of data preparation
and cleaning in order to enhance the quality of datasets used for analysis or modeling. By working
through real-world or simulated data, the project seeks to demonstrate how effective preprocessing
can improve data usability and accuracy, thus enabling better insights and decision-making.
A review of existing literature highlights the critical role of data preparation and cleaning in data
science workflows. According to Kandel et al. (2011), data analysts spend approximately 80% of
their time preparing data, underscoring its importance. Techniques such as imputation for missing
values, outlier detection, normalization, and data transformation are well-documented in both
academic and industry practices.
Several studies, including Rahm and Do (2000), emphasize the challenges in data integration and
transformation, especially when dealing with heterogeneous sources. Tools like OpenRefine, Python
(with Pandas and NumPy), and R provide comprehensive frameworks for automating and
streamlining the data cleaning process.
5.0 Proposed Methodology
1. Data Collection
Acquire a dataset from a public source (e.g., Kaggle, UCI Machine Learning Repository) or a
custom-generated set.
2. Data Inspection
Explore the dataset to identify quality issues such as missing values, duplicates, outliers, and
inconsistencies.
3. Data Cleaning
o Handle missing values (e.g., removal, mean/median imputation).
o Remove or correct duplicate records.
o Detect and treat outliers.
o Correct inconsistent formatting or erroneous entries.
4. Data Transformation
o Normalize or scale numerical features.
o Encode categorical variables if needed (e.g., one-hot encoding).
o Format date/time values into consistent formats.
5. Data Validation
Evaluate the cleaned dataset to ensure data integrity and suitability for further analysis or modeling.
6. Documentation and Reporting
Record all steps taken, justifying the cleaning decisions and demonstrating before-and-after
comparisons.
6.0 Resources Required
3 Internet Any 1
7.0 Action Plan
1 Project Proposal
Gauri Santosh ambi
2 Data Collection & Analysis
Akanksha Dhanaji Kadam
3 Preparation of Prototype/
Model Shreya anil yadav
4 Preparation of Report
Akanksha Dhanaji Kadam
5 Presentation & Submission
Shreya anil yadav
PART B - Micro-Project Proposal
1.0 Rationale:
In today's data-driven world, the availability of raw data is abundant, but its usefulness is often
limited due to issues such as incompleteness, inconsistency, noise, and redundancy. Without proper
preparation and cleaning, raw data can lead to inaccurate analysis, misleading insights, and
unreliable predictive models. Hence, there is a growing need to emphasize the importance of data
preprocessing in any data-centric project.
The significance of data preparation has been emphasized across academic literature and industrial
practices. Kandel et al. (2011) report that up to 80% of a data scientist’s time is spent preparing and
cleaning data. This process includes handling missing data, detecting outliers, and transforming data
into usable formats.
Rahm and Do (2000) highlighted the challenges in data cleaning, particularly in integrating data
from multiple heterogeneous sources. Various tools and libraries have been developed to aid in this
process, including OpenRefine, Python (Pandas, NumPy), and R. Studies further confirm that clean
data significantly improves the performance of machine learning models and data-driven decision
systems.
4.0 Actual Methodology Followed:
1. Dataset Selection:
A real-world dataset was sourced from [e.g., Kaggle – “Housing Prices Dataset”].
2. Initial Data Inspection:
o Checked for missing values, nulls, and NaNs.
o Identified duplicate rows.
o Explored data types and summary statistics.
3. Data Cleaning:
o Replaced missing values using mean/median for numerical columns and mode for
categorical columns.
o Removed duplicate entries.
o Handled outliers using IQR (Interquartile Range) method.
o Standardized inconsistent formats (e.g., date/time).
4. Data Transformation:
o Normalized numerical values using Min-Max scaling.
o Encoded categorical variables using one-hot encoding.
o Converted string values to lowercase for uniformity.
5. Validation:
o Verified the absence of null values and duplicates post-cleaning.
o Checked for logical consistency and completeness.
3 Internet Any 1