0% found this document useful (0 votes)

6 views14 pages

ETI Microproject

The document outlines a micro-project titled 'Data Preparation And Cleaning' conducted by students at Loknete Hon. Hanmantrao Patil Charitable Trust’s Adarsh Institute of Technology and Research Centre. The project aims to enhance data quality through techniques such as data cleaning and transformation, addressing common issues like missing values and duplicates. The report includes a detailed methodology, progress report, and acknowledgments, emphasizing the importance of clean data in analytics.

Uploaded by

Falak Mulla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

ETI Microproject

Uploaded by

Falak Mulla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Loknete Hon.

Hanmantrao Patil Charitable Trust’s

ADARSH INSTITUTE OF TECHNOLOGY AND

RESEARCH CENTRE ,VITA
MSBTE-0991

SIXTH SEMESTER
(Year: 2024-25)
Micro Project

Big Data Analytics (22684)

Title of the Project: Data Preparation And Cleaning.

Branch: Artificial Intelligence & Machine Learning (AN6I)

Members of the Group:

Sr. No. Name of Student Roll No.

01 Gauri Santosh Ambi 3102

02 Akanksha Dhanaji Kadam 3104

03 Shreya Anil Yadav 3116

Loknete Hon. HanmantraoPatil Charitable Trust’s
Adarsh Institute of Technology & Research Centre Vita,

CERTIFICATE
This is to certify that the micro project report entitled
“Data Preparation And Cleaning.”
Submitted by
Sr. No. Name of Student Roll No.

01 Gauri Santosh Ambi 3102

02 Akanksha Dhanaji Kadam 3104

03 Shreya Anil Yadav 3116

For Sixth Semester of Diploma inartificial Intelligence & Machine Learning of course of
Big Data Analytics (22684) for academic year 2024-25 as per MSBTE, Mumbai curriculum
of ‘I’ scheme.

DIPLOMA OFENGINEERING
(Artificial Intelligence & Machine Learning)

SUBMITTED TO
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION MUMBAI
ACADEMIC YEAR 2024-25

Project Guide H.O.D. Principal

Ms. S.S.Deshmukh Prof. A. A. Vankudre Dr. P. S.Patil
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION, MUMBAI

MICRO PROJECT
Progress Report / Weekly Report

Title of the Project: : Data Preparation And Cleaning

Course: ETI (220684) Program: Artificial Intelligence & Machine Learning (AN6I)

Sign of the
Week No Date Duration in Hrs. Work / Activity Performed
Faculty
1 30 min Knowing the basic

2 30 min Decide Aim

3 1 hour Collect the Data

4 45 min Prepare project proposal

5 20 min Search Literature review

6 30 min Analysis of Data

7 20 min Discussion over preparing

8 10 min Correction in Booklets

9 30 min Report writing

10 20 min Checking report

11 10 min Correction report write

12 15 min Rechecking report

13 10 min Finalizing report writing

14 45 min Final submission

Oral presentation of micro

15 10 min
project
TOTAL 6 hrs 15 min
Teacher Evaluation Sheet for Micro Project

Course Title and Code: - ETI(22684)

Title of the Project. Data Preparation And Cleaning

COs addressed by the Micro Project:

Understand the importance and impact of data quality in data analysis

CO a:
Identify common data issues such as missing values, duplicates, and inconsistencies.
CO b:

Identify common data issues such as missing values, duplicates, and inconsistencies. \
CO c:

CO d: Identify common data issues such as missing values, duplicates, and inconsistencies.

CO e: Demonstrate improved decision-making capabilities through the use of clean and structured data

Marks:-

Marks obtained
Marks by the Total
Roll Name Of Student for individual Marks
No. Group based on viva (10)
Work (04)
(06)
3104 Akanksha Dhanaji Kadam

Name and designation of Faculty Member: Ms. S. S Deshmukh

Lecturer (Department of science and Humanities)

Signature:
ACKNOWLEDGEMENT

I express my sincere gratitude to Ms. S. S. Deshmukh Department of Artifical intelligence &

machine learning for his/her stimulating guidance, continuous encouragement and supervision
throughout the course of present work.

I would like to place on record my deep sense of gratitude to Prof. A. A. Vankudre

HOD- Department of Artificial Intelligence & Machine Learning, for his generous guidance, help
and useful suggestions.

I am extremely thankful to Principal Dr. P. S. Patil for this motivation and providing me
infrastructural facilities to work in, without which this work would not have been possible.

I would like to express my gratitude to all my colleagues for their support, co-operation and
fruitful discussions on diverse seminar topics and technical help.

Name of Student Sign

1. Gauri Santosh Ambi .

2. Akanksha Dhanaji Kadam.

3. Shreya Anil Yadav.

Index

Sr. No. Content Page No.

1.0 Rationale

2.0 Course Outcomes Addressed

3.0 Literature Review

4.0 Actual Methodology Followed

5.0 Actual Resources Used

6.0 Outputs of the Micro Project

7.0 Skill Developed / learning out of this Micro Project

8.0 Applications of this Micro Project

9.0 Area of Future Improvement

PART A - Micro-Project Proposal

Title of Micro-Project: Data Preparation And Cleaning.

1.0 Brief Introduction

In the realm of data analysis and machine learning, data preparation and cleaning are foundational
steps that significantly influence the accuracy and reliability of results. Raw data collected from
various sources—such as surveys, sensors, logs, or online platforms—often contain inconsistencies,
missing values, duplicates, and errors that must be addressed before meaningful analysis can begin.
Data preparation involves transforming raw data into a structured and usable format. This includes
tasks like data integration, formatting, normalization, and transformation. On the other hand, data
cleaning focuses specifically on identifying and rectifying data quality issues, such as incorrect
entries, outliers, and missing or duplicated values

2.0 Aim of the Micro-Project

The primary aim of this micro-project is to understand and apply the techniques of data preparation
and cleaning in order to enhance the quality of datasets used for analysis or modeling. By working
through real-world or simulated data, the project seeks to demonstrate how effective preprocessing
can improve data usability and accuracy, thus enabling better insights and decision-making.

3.0 Intended Course Outcomes

Upon successful completion of this micro-project, learners will be able to:

 Identify common data quality issues in raw datasets.

 Apply various data cleaning techniques to handle missing, inconsistent, or erroneous data.
 Perform data transformation and normalization to prepare data for analysis or modeling.
 Demonstrate proficiency in using tools and libraries (e.g., Python, Pandas, Excel) for data
preprocessing.
 Understand the impact of data preparation on the accuracy and reliability of analytical outcome

4.0 Literature Review

A review of existing literature highlights the critical role of data preparation and cleaning in data
science workflows. According to Kandel et al. (2011), data analysts spend approximately 80% of
their time preparing data, underscoring its importance. Techniques such as imputation for missing
values, outlier detection, normalization, and data transformation are well-documented in both
academic and industry practices.

Several studies, including Rahm and Do (2000), emphasize the challenges in data integration and
transformation, especially when dealing with heterogeneous sources. Tools like OpenRefine, Python
(with Pandas and NumPy), and R provide comprehensive frameworks for automating and
streamlining the data cleaning process.
5.0 Proposed Methodology

The methodology for this micro-project includes the following steps:

1. Data Collection
Acquire a dataset from a public source (e.g., Kaggle, UCI Machine Learning Repository) or a
custom-generated set.
2. Data Inspection
Explore the dataset to identify quality issues such as missing values, duplicates, outliers, and
inconsistencies.
3. Data Cleaning
o Handle missing values (e.g., removal, mean/median imputation).
o Remove or correct duplicate records.
o Detect and treat outliers.
o Correct inconsistent formatting or erroneous entries.
4. Data Transformation
o Normalize or scale numerical features.
o Encode categorical variables if needed (e.g., one-hot encoding).
o Format date/time values into consistent formats.
5. Data Validation
Evaluate the cleaned dataset to ensure data integrity and suitability for further analysis or modeling.
6. Documentation and Reporting
Record all steps taken, justifying the cleaning decisions and demonstrating before-and-after
comparisons.
6.0 Resources Required

Sr. Name of Resource/ Material Specifications Quantity Remark

No.

1 Computer System i-5 1

2 Microsoft Word 2010 1

3 Internet Any 1
7.0 Action Plan

Planned Name of Responsible Team

Sr. Planned
Details of activity Finish date Members
No. start date

1 Project Proposal
Gauri Santosh ambi
2 Data Collection & Analysis
Akanksha Dhanaji Kadam
3 Preparation of Prototype/
Model Shreya anil yadav
4 Preparation of Report
Akanksha Dhanaji Kadam
5 Presentation & Submission
Shreya anil yadav
PART B - Micro-Project Proposal

Title of Micro-Project: Data Preparation And Cleaning

1.0 Rationale:
In today's data-driven world, the availability of raw data is abundant, but its usefulness is often
limited due to issues such as incompleteness, inconsistency, noise, and redundancy. Without proper
preparation and cleaning, raw data can lead to inaccurate analysis, misleading insights, and
unreliable predictive models. Hence, there is a growing need to emphasize the importance of data
preprocessing in any data-centric project.

2.0 Course Outcomes Addressed

 CO1: Demonstrate understanding of the role of data quality in analytics.

By identifying and resolving issues such as missing values, outliers, and inconsistent data,
students develop a deep appreciation for the impact of data quality on the accuracy and
reliability of analytical outcomes.
 CO2: Apply appropriate data preprocessing techniques.
The project enables students to practice cleaning techniques such as imputation, deduplication,
and normalization on real-world datasets using tools like Python and Pandas.
 CO3: Utilize data wrangling tools and libraries effectively.
Through hands-on experience with libraries such as Pandas, NumPy, and possibly OpenRefine
or Excel, students learn to streamline the data preparation process and automate repetitive tasks.
 CO4: Prepare structured and clean datasets for analysis or modeling.
Students transform messy, unstructured datasets into well-organized forms suitable for
visualization, analysis, or feeding into machine learning models.
 CO5: Document the data cleaning and transformation process.
The project encourages thorough documentation of each preprocessing step, which helps in
understanding the reasoning behind cleaning decisions and ensures transparency in data
handling.

3.0 Literature Review

The significance of data preparation has been emphasized across academic literature and industrial
practices. Kandel et al. (2011) report that up to 80% of a data scientist’s time is spent preparing and
cleaning data. This process includes handling missing data, detecting outliers, and transforming data
into usable formats.

Rahm and Do (2000) highlighted the challenges in data cleaning, particularly in integrating data
from multiple heterogeneous sources. Various tools and libraries have been developed to aid in this
process, including OpenRefine, Python (Pandas, NumPy), and R. Studies further confirm that clean
data significantly improves the performance of machine learning models and data-driven decision
systems.
4.0 Actual Methodology Followed:

The following methodology was implemented during the micro-project:

1. Dataset Selection:
A real-world dataset was sourced from [e.g., Kaggle – “Housing Prices Dataset”].
2. Initial Data Inspection:
o Checked for missing values, nulls, and NaNs.
o Identified duplicate rows.
o Explored data types and summary statistics.
3. Data Cleaning:
o Replaced missing values using mean/median for numerical columns and mode for
categorical columns.
o Removed duplicate entries.
o Handled outliers using IQR (Interquartile Range) method.
o Standardized inconsistent formats (e.g., date/time).
4. Data Transformation:
o Normalized numerical values using Min-Max scaling.
o Encoded categorical variables using one-hot encoding.
o Converted string values to lowercase for uniformity.
5. Validation:
o Verified the absence of null values and duplicates post-cleaning.
o Checked for logical consistency and completeness.

5.0 Actual Resources Used

Sr. Name of Resource/ Material Specifications Quantity Remark

No.

1 Computer System i-5 1

2 Microsoft Word 2010 1

3 Internet Any 1

6.0 Outputs of Micro-Projects

 A clean, transformed, and analysis-ready dataset.
 A detailed log/report of cleaning steps taken.
 Visualizations (before and after cleaning) showing the improvements in data quality.
 Python scripts or Jupyter Notebooks demonstrating the entire workflow.
Conclusion
Data preparation and cleaning are critical steps in any data science pipeline. Through this micro-
project, we learned how to identify, address, and document common data quality issues, leading to more
accurate and reliable datasets. These preprocessing techniques ensure that downstream tasks such as
visualization, statistical analysis, or model building are based on trustworthy inputs. Overall, this project
reinforces the notion that clean data is not just helpful—but essential—for effective data-driven
decision-making.

Student Support Material For All Students - Class - XII - IP - 0
No ratings yet
Student Support Material For All Students - Class - XII - IP - 0
173 pages
Bda Report
No ratings yet
Bda Report
10 pages
Data Science and Machine Learning Syllabus V1.0
No ratings yet
Data Science and Machine Learning Syllabus V1.0
6 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Atulay PROJECT REPORT FILE DS 1
No ratings yet
Atulay PROJECT REPORT FILE DS 1
27 pages
Minor Project (7-37)
No ratings yet
Minor Project (7-37)
31 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
PWP Project Done
No ratings yet
PWP Project Done
19 pages
Ericsson Microwave Products
100% (1)
Ericsson Microwave Products
155 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
63 pages
12 WPC Microproject
0% (1)
12 WPC Microproject
35 pages
Summer Training Report
No ratings yet
Summer Training Report
22 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
FDSA - Unit I - Intro To Data Science Part02
No ratings yet
FDSA - Unit I - Intro To Data Science Part02
56 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Ddais (P) 1
No ratings yet
Ddais (P) 1
26 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
2609 BDA Final
No ratings yet
2609 BDA Final
23 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Anshu Complete Data Science Files
No ratings yet
Anshu Complete Data Science Files
26 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
ETIREPORT NEWpdf
No ratings yet
ETIREPORT NEWpdf
18 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Data Processing
No ratings yet
Data Processing
14 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Data Migration Process Infographics by Slidesgo
No ratings yet
Data Migration Process Infographics by Slidesgo
9 pages
Eti 1519f
No ratings yet
Eti 1519f
22 pages
Data Cleaning Checklist & AI Prompts (40 Prompts)
No ratings yet
Data Cleaning Checklist & AI Prompts (40 Prompts)
10 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
AIDS C04-Session-21
No ratings yet
AIDS C04-Session-21
18 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
Assessment 3-Group Assignment
No ratings yet
Assessment 3-Group Assignment
3 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
6 pages
Data Cleaning Thesis
100% (2)
Data Cleaning Thesis
5 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
10 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Handboojk Updates
No ratings yet
Handboojk Updates
36 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Mid Term Project
No ratings yet
Mid Term Project
3 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Ashish New ML4.0
No ratings yet
Ashish New ML4.0
2 pages
TRB CV
No ratings yet
TRB CV
2 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Dsa Report
No ratings yet
Dsa Report
24 pages
Lec 9
No ratings yet
Lec 9
1 page
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
LESSON 2 - Assembling A Computer - Performance Checklist
No ratings yet
LESSON 2 - Assembling A Computer - Performance Checklist
2 pages
Unit 2
No ratings yet
Unit 2
11 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Seminar Report New
No ratings yet
Seminar Report New
27 pages
Lesson 6 Revolved Features: Objectives
No ratings yet
Lesson 6 Revolved Features: Objectives
54 pages
Dark Web
No ratings yet
Dark Web
23 pages
AZ 900 - Complete Notes
No ratings yet
AZ 900 - Complete Notes
90 pages
Message Queues, Semaphores, Shared Memory
No ratings yet
Message Queues, Semaphores, Shared Memory
46 pages
FOC Unit - 2
No ratings yet
FOC Unit - 2
13 pages
Automation Systems and Control Components: Consistent, Intelligent and Future-Proof
No ratings yet
Automation Systems and Control Components: Consistent, Intelligent and Future-Proof
168 pages
CoolSpools Programmers Guide V6R1
No ratings yet
CoolSpools Programmers Guide V6R1
222 pages
Powder Coating Systems
No ratings yet
Powder Coating Systems
12 pages
Chairman of The Joint Chiefs of Staff Instruction
No ratings yet
Chairman of The Joint Chiefs of Staff Instruction
28 pages
3-Arithmetic and Logic Instructions
No ratings yet
3-Arithmetic and Logic Instructions
83 pages
PhilRice Citizens Charter Handbook v3
No ratings yet
PhilRice Citizens Charter Handbook v3
55 pages
GR12-CS
No ratings yet
GR12-CS
8 pages
27.1.5 Lab - Convert Data Into A Universal Format - ILM
No ratings yet
27.1.5 Lab - Convert Data Into A Universal Format - ILM
9 pages
Dbms 2
No ratings yet
Dbms 2
28 pages
Goboard Catalog 2
No ratings yet
Goboard Catalog 2
12 pages
Digital Tachometer
No ratings yet
Digital Tachometer
61 pages
Syllabus of Unreal Engine Certification Online Tra
No ratings yet
Syllabus of Unreal Engine Certification Online Tra
5 pages
Pedometer
No ratings yet
Pedometer
5 pages
Seminar On Blockchain
No ratings yet
Seminar On Blockchain
15 pages
Catalog Placement Tester User's Guide
No ratings yet
Catalog Placement Tester User's Guide
21 pages
Session 2: - Manipulating Container With Docker Client
No ratings yet
Session 2: - Manipulating Container With Docker Client
20 pages
Project Ep Iii
No ratings yet
Project Ep Iii
12 pages
Survey Paper
No ratings yet
Survey Paper
4 pages
Operatingsystem: Library Version: 3.2.2 Library Scope: Named Arguments: Supported
No ratings yet
Operatingsystem: Library Version: 3.2.2 Library Scope: Named Arguments: Supported
22 pages
Ethical Issues in Artificial Intelligence
No ratings yet
Ethical Issues in Artificial Intelligence
2 pages
Machine Learning Engineer
No ratings yet
Machine Learning Engineer
2 pages
Profiting from Artificial Intelligence: Data as a source of competitive advantage
From Everand
Profiting from Artificial Intelligence: Data as a source of competitive advantage
Philipp Max Hartmann
No ratings yet

ETI Microproject

Uploaded by

ETI Microproject

Uploaded by

Loknete Hon.

Hanmantrao Patil Charitable Trust’s

ADARSH INSTITUTE OF TECHNOLOGY AND

Big Data Analytics (22684)

Title of the Project: Data Preparation And Cleaning.

Members of the Group:

Sr. No. Name of Student Roll No.

01 Gauri Santosh Ambi 3102

03 Shreya Anil Yadav 3116

01 Gauri Santosh Ambi 3102

02 Akanksha Dhanaji Kadam 3104

03 Shreya Anil Yadav 3116

Project Guide H.O.D. Principal

Title of the Project: : Data Preparation And Cleaning

2 30 min Decide Aim

3 1 hour Collect the Data

4 45 min Prepare project proposal

5 20 min Search Literature review

6 30 min Analysis of Data

7 20 min Discussion over preparing

8 10 min Correction in Booklets

9 30 min Report writing

10 20 min Checking report

11 10 min Correction report write

12 15 min Rechecking report

13 10 min Finalizing report writing

14 45 min Final submission

Oral presentation of micro

Course Title and Code: - ETI(22684)

Title of the Project. Data Preparation And Cleaning

Understand the importance and impact of data quality in data analysis

Name and designation of Faculty Member: Ms. S. S Deshmukh

I express my sincere gratitude to Ms. S. S. Deshmukh Department of Artifical intelligence &

I would like to place on record my deep sense of gratitude to Prof. A. A. Vankudre

Name of Student Sign

1. Gauri Santosh Ambi .

2. Akanksha Dhanaji Kadam.

3. Shreya Anil Yadav.

Sr. No. Content Page No.

2.0 Course Outcomes Addressed

3.0 Literature Review

4.0 Actual Methodology Followed

5.0 Actual Resources Used

6.0 Outputs of the Micro Project

7.0 Skill Developed / learning out of this Micro Project

8.0 Applications of this Micro Project

9.0 Area of Future Improvement

Title of Micro-Project: Data Preparation And Cleaning.

1.0 Brief Introduction

2.0 Aim of the Micro-Project

3.0 Intended Course Outcomes

Upon successful completion of this micro-project, learners will be able to:

 Identify common data quality issues in raw datasets.

4.0 Literature Review

The methodology for this micro-project includes the following steps:

Sr. Name of Resource/ Material Specifications Quantity Remark

1 Computer System i-5 1

2 Microsoft Word 2010 1

Planned Name of Responsible Team

Title of Micro-Project: Data Preparation And Cleaning

2.0 Course Outcomes Addressed

 CO1: Demonstrate understanding of the role of data quality in analytics.

3.0 Literature Review

The following methodology was implemented during the micro-project:

5.0 Actual Resources Used

Sr. Name of Resource/ Material Specifications Quantity Remark

1 Computer System i-5 1

2 Microsoft Word 2010 1

6.0 Outputs of Micro-Projects

You might also like