0% found this document useful (0 votes)
6 views14 pages

ETI Microproject

The document outlines a micro-project titled 'Data Preparation And Cleaning' conducted by students at Loknete Hon. Hanmantrao Patil Charitable Trust’s Adarsh Institute of Technology and Research Centre. The project aims to enhance data quality through techniques such as data cleaning and transformation, addressing common issues like missing values and duplicates. The report includes a detailed methodology, progress report, and acknowledgments, emphasizing the importance of clean data in analytics.

Uploaded by

Falak Mulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

ETI Microproject

The document outlines a micro-project titled 'Data Preparation And Cleaning' conducted by students at Loknete Hon. Hanmantrao Patil Charitable Trust’s Adarsh Institute of Technology and Research Centre. The project aims to enhance data quality through techniques such as data cleaning and transformation, addressing common issues like missing values and duplicates. The report includes a detailed methodology, progress report, and acknowledgments, emphasizing the importance of clean data in analytics.

Uploaded by

Falak Mulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Loknete Hon.

Hanmantrao Patil Charitable Trust’s

ADARSH INSTITUTE OF TECHNOLOGY AND


RESEARCH CENTRE ,VITA
MSBTE-0991

SIXTH SEMESTER
(Year: 2024-25)
Micro Project

Big Data Analytics (22684)

Title of the Project: Data Preparation And Cleaning.


Branch: Artificial Intelligence & Machine Learning (AN6I)

Members of the Group:

Sr. No. Name of Student Roll No.

01 Gauri Santosh Ambi 3102


02 Akanksha Dhanaji Kadam 3104

03 Shreya Anil Yadav 3116


Loknete Hon. HanmantraoPatil Charitable Trust’s
Adarsh Institute of Technology & Research Centre Vita,

CERTIFICATE
This is to certify that the micro project report entitled
“Data Preparation And Cleaning.”
Submitted by
Sr. No. Name of Student Roll No.

01 Gauri Santosh Ambi 3102

02 Akanksha Dhanaji Kadam 3104

03 Shreya Anil Yadav 3116

For Sixth Semester of Diploma inartificial Intelligence & Machine Learning of course of
Big Data Analytics (22684) for academic year 2024-25 as per MSBTE, Mumbai curriculum
of ‘I’ scheme.

DIPLOMA OFENGINEERING
(Artificial Intelligence & Machine Learning)

SUBMITTED TO
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION MUMBAI
ACADEMIC YEAR 2024-25

Project Guide H.O.D. Principal


Ms. S.S.Deshmukh Prof. A. A. Vankudre Dr. P. S.Patil
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION, MUMBAI

MICRO PROJECT
Progress Report / Weekly Report

Title of the Project: : Data Preparation And Cleaning


Course: ETI (220684) Program: Artificial Intelligence & Machine Learning (AN6I)

Sign of the
Week No Date Duration in Hrs. Work / Activity Performed
Faculty
1 30 min Knowing the basic

2 30 min Decide Aim

3 1 hour Collect the Data

4 45 min Prepare project proposal

5 20 min Search Literature review

6 30 min Analysis of Data

7 20 min Discussion over preparing

8 10 min Correction in Booklets

9 30 min Report writing

10 20 min Checking report

11 10 min Correction report write

12 15 min Rechecking report

13 10 min Finalizing report writing

14 45 min Final submission

Oral presentation of micro


15 10 min
project
TOTAL 6 hrs 15 min
Teacher Evaluation Sheet for Micro Project

Course Title and Code: - ETI(22684)

Title of the Project. Data Preparation And Cleaning


COs addressed by the Micro Project:

Understand the importance and impact of data quality in data analysis


CO a:
Identify common data issues such as missing values, duplicates, and inconsistencies.
CO b:

Identify common data issues such as missing values, duplicates, and inconsistencies. \
CO c:

CO d: Identify common data issues such as missing values, duplicates, and inconsistencies.

CO e: Demonstrate improved decision-making capabilities through the use of clean and structured data

Marks:-

Marks obtained
Marks by the Total
Roll Name Of Student for individual Marks
No. Group based on viva (10)
Work (04)
(06)
3104 Akanksha Dhanaji Kadam

Name and designation of Faculty Member: Ms. S. S Deshmukh


Lecturer (Department of science and Humanities)

Signature:
ACKNOWLEDGEMENT

I express my sincere gratitude to Ms. S. S. Deshmukh Department of Artifical intelligence &


machine learning for his/her stimulating guidance, continuous encouragement and supervision
throughout the course of present work.

I would like to place on record my deep sense of gratitude to Prof. A. A. Vankudre


HOD- Department of Artificial Intelligence & Machine Learning, for his generous guidance, help
and useful suggestions.

I am extremely thankful to Principal Dr. P. S. Patil for this motivation and providing me
infrastructural facilities to work in, without which this work would not have been possible.

I would like to express my gratitude to all my colleagues for their support, co-operation and
fruitful discussions on diverse seminar topics and technical help.

Name of Student Sign

1. Gauri Santosh Ambi .

2. Akanksha Dhanaji Kadam.

3. Shreya Anil Yadav.


Index

Sr. No. Content Page No.

1.0 Rationale

2.0 Course Outcomes Addressed

3.0 Literature Review

4.0 Actual Methodology Followed

5.0 Actual Resources Used

6.0 Outputs of the Micro Project

7.0 Skill Developed / learning out of this Micro Project

8.0 Applications of this Micro Project

9.0 Area of Future Improvement


PART A - Micro-Project Proposal

Title of Micro-Project: Data Preparation And Cleaning.

1.0 Brief Introduction

In the realm of data analysis and machine learning, data preparation and cleaning are foundational
steps that significantly influence the accuracy and reliability of results. Raw data collected from
various sources—such as surveys, sensors, logs, or online platforms—often contain inconsistencies,
missing values, duplicates, and errors that must be addressed before meaningful analysis can begin.
Data preparation involves transforming raw data into a structured and usable format. This includes
tasks like data integration, formatting, normalization, and transformation. On the other hand, data
cleaning focuses specifically on identifying and rectifying data quality issues, such as incorrect
entries, outliers, and missing or duplicated values

2.0 Aim of the Micro-Project

The primary aim of this micro-project is to understand and apply the techniques of data preparation
and cleaning in order to enhance the quality of datasets used for analysis or modeling. By working
through real-world or simulated data, the project seeks to demonstrate how effective preprocessing
can improve data usability and accuracy, thus enabling better insights and decision-making.

3.0 Intended Course Outcomes

Upon successful completion of this micro-project, learners will be able to:

 Identify common data quality issues in raw datasets.


 Apply various data cleaning techniques to handle missing, inconsistent, or erroneous data.
 Perform data transformation and normalization to prepare data for analysis or modeling.
 Demonstrate proficiency in using tools and libraries (e.g., Python, Pandas, Excel) for data
preprocessing.
 Understand the impact of data preparation on the accuracy and reliability of analytical outcome

4.0 Literature Review

A review of existing literature highlights the critical role of data preparation and cleaning in data
science workflows. According to Kandel et al. (2011), data analysts spend approximately 80% of
their time preparing data, underscoring its importance. Techniques such as imputation for missing
values, outlier detection, normalization, and data transformation are well-documented in both
academic and industry practices.

Several studies, including Rahm and Do (2000), emphasize the challenges in data integration and
transformation, especially when dealing with heterogeneous sources. Tools like OpenRefine, Python
(with Pandas and NumPy), and R provide comprehensive frameworks for automating and
streamlining the data cleaning process.
5.0 Proposed Methodology

The methodology for this micro-project includes the following steps:

1. Data Collection
Acquire a dataset from a public source (e.g., Kaggle, UCI Machine Learning Repository) or a
custom-generated set.
2. Data Inspection
Explore the dataset to identify quality issues such as missing values, duplicates, outliers, and
inconsistencies.
3. Data Cleaning
o Handle missing values (e.g., removal, mean/median imputation).
o Remove or correct duplicate records.
o Detect and treat outliers.
o Correct inconsistent formatting or erroneous entries.
4. Data Transformation
o Normalize or scale numerical features.
o Encode categorical variables if needed (e.g., one-hot encoding).
o Format date/time values into consistent formats.
5. Data Validation
Evaluate the cleaned dataset to ensure data integrity and suitability for further analysis or modeling.
6. Documentation and Reporting
Record all steps taken, justifying the cleaning decisions and demonstrating before-and-after
comparisons.
6.0 Resources Required

Sr. Name of Resource/ Material Specifications Quantity Remark


No.

1 Computer System i-5 1

2 Microsoft Word 2010 1

3 Internet Any 1
7.0 Action Plan

Planned Name of Responsible Team


Sr. Planned
Details of activity Finish date Members
No. start date

1 Project Proposal
Gauri Santosh ambi
2 Data Collection & Analysis
Akanksha Dhanaji Kadam
3 Preparation of Prototype/
Model Shreya anil yadav
4 Preparation of Report
Akanksha Dhanaji Kadam
5 Presentation & Submission
Shreya anil yadav
PART B - Micro-Project Proposal

Title of Micro-Project: Data Preparation And Cleaning

1.0 Rationale:
In today's data-driven world, the availability of raw data is abundant, but its usefulness is often
limited due to issues such as incompleteness, inconsistency, noise, and redundancy. Without proper
preparation and cleaning, raw data can lead to inaccurate analysis, misleading insights, and
unreliable predictive models. Hence, there is a growing need to emphasize the importance of data
preprocessing in any data-centric project.

2.0 Course Outcomes Addressed

 CO1: Demonstrate understanding of the role of data quality in analytics.


By identifying and resolving issues such as missing values, outliers, and inconsistent data,
students develop a deep appreciation for the impact of data quality on the accuracy and
reliability of analytical outcomes.
 CO2: Apply appropriate data preprocessing techniques.
The project enables students to practice cleaning techniques such as imputation, deduplication,
and normalization on real-world datasets using tools like Python and Pandas.
 CO3: Utilize data wrangling tools and libraries effectively.
Through hands-on experience with libraries such as Pandas, NumPy, and possibly OpenRefine
or Excel, students learn to streamline the data preparation process and automate repetitive tasks.
 CO4: Prepare structured and clean datasets for analysis or modeling.
Students transform messy, unstructured datasets into well-organized forms suitable for
visualization, analysis, or feeding into machine learning models.
 CO5: Document the data cleaning and transformation process.
The project encourages thorough documentation of each preprocessing step, which helps in
understanding the reasoning behind cleaning decisions and ensures transparency in data
handling.

3.0 Literature Review

The significance of data preparation has been emphasized across academic literature and industrial
practices. Kandel et al. (2011) report that up to 80% of a data scientist’s time is spent preparing and
cleaning data. This process includes handling missing data, detecting outliers, and transforming data
into usable formats.

Rahm and Do (2000) highlighted the challenges in data cleaning, particularly in integrating data
from multiple heterogeneous sources. Various tools and libraries have been developed to aid in this
process, including OpenRefine, Python (Pandas, NumPy), and R. Studies further confirm that clean
data significantly improves the performance of machine learning models and data-driven decision
systems.
4.0 Actual Methodology Followed:

The following methodology was implemented during the micro-project:

1. Dataset Selection:
A real-world dataset was sourced from [e.g., Kaggle – “Housing Prices Dataset”].
2. Initial Data Inspection:
o Checked for missing values, nulls, and NaNs.
o Identified duplicate rows.
o Explored data types and summary statistics.
3. Data Cleaning:
o Replaced missing values using mean/median for numerical columns and mode for
categorical columns.
o Removed duplicate entries.
o Handled outliers using IQR (Interquartile Range) method.
o Standardized inconsistent formats (e.g., date/time).
4. Data Transformation:
o Normalized numerical values using Min-Max scaling.
o Encoded categorical variables using one-hot encoding.
o Converted string values to lowercase for uniformity.
5. Validation:
o Verified the absence of null values and duplicates post-cleaning.
o Checked for logical consistency and completeness.

5.0 Actual Resources Used

Sr. Name of Resource/ Material Specifications Quantity Remark


No.

1 Computer System i-5 1

2 Microsoft Word 2010 1

3 Internet Any 1

6.0 Outputs of Micro-Projects


 A clean, transformed, and analysis-ready dataset.
 A detailed log/report of cleaning steps taken.
 Visualizations (before and after cleaning) showing the improvements in data quality.
 Python scripts or Jupyter Notebooks demonstrating the entire workflow.
Conclusion
Data preparation and cleaning are critical steps in any data science pipeline. Through this micro-
project, we learned how to identify, address, and document common data quality issues, leading to more
accurate and reliable datasets. These preprocessing techniques ensure that downstream tasks such as
visualization, statistical analysis, or model building are based on trustworthy inputs. Overall, this project
reinforces the notion that clean data is not just helpful—but essential—for effective data-driven
decision-making.

You might also like