Intro Intern Final Merged
Intro Intern Final Merged
Internship Report
On
submitted by
VAMSHITHA
1ME21CS110
Internship carried out at
M S ENGINEERING COLLEGE
NAAC Accredited, affiliated to VTU, Belagavi, Approved by AICTE New Delhi,
CERTIFICATE
Certified that the Internship work on topic Ai Data Quality Analyst has been
successfully carried out at M S Engineering College by VAMSHITHA, bearing USN
1ME21CS110, in partial fulfillment of the requirements for the IV Semester, degree of
Bachelor of Engineering in Computer Science and Engineering of Visvesvaraya
Technological University, Belgaum during academic year 2024-2025. It is certified that
all corrections/suggestions indicated for Internal Assessment have been incorporated in
the report deposited in the departmental library. The Internship report has been approved
as it satisfies the academic requirements in respect of Internship work prescribed for the
said degree.
-------------------------------- ---------------------------------------
Dr. MALATESH S H Dr. RANAPRATHAP REDDY
Date: VAMSHITHA
The satisfaction and euphoria that accompany the successful completion of any task would
be incomplete without the mention of the people who made it possible, whose constant
guidance and encouragement crowned the efforts with success.
I would like to express my thanks to the Principal Dr. RANAPRATAP REDDY, for
their encouragement that motivated me for the successful completion of Internship
work.
Last, but not the least, I would hereby acknowledge and thank my parents who have been
a source of inspiration and also instrumental in the successful completion of the Internship
work.
VAMSHITHA
1ME21CS110
ABSTRACT
This internship at NSDC (National Skill Development Corporation) in collaboration with Rooman
Technologies focused on the critical role of data quality in AI and machine learning systems. As an AI
Data Quality Analyst intern, I was involved in the collection, preprocessing, validation, and analysis of
datasets used in training and evaluating AI models.
The primary objective was to ensure that the data used for AI development was accurate, consistent, and
relevant, thereby improving the performance and reliability of AI solutions. Tasks included data cleaning,
annotation verification, error detection, and maintaining documentation of data pipelines. I also
contributed to creating guidelines for data labelling and quality assurance protocols.
Through this internship, I gained hands-on experience with tools and techniques used in AI data
management, including Python-based libraries (such as Pandas and NumPy), data visualization tools, and
version control systems. It provided valuable exposure to real-world AI projects and reinforced the
importance of high-quality data in driving effective and ethical AI outcomes.
SL No. CONTENTS Page No.
Chapter 3 INTRODUCTION 04
Chapter 4 OBJECTIVES 05
4.1 problem statement
Chapter 5 METHODOLOGY 07
Chapter 7 RESULTS 11
CONCLUSION 13
REFERENCE 14
CHAPTER 1
COMPANY PROFILE
Its primary objective is to bridge the skill gap in various sectors by funding and
supporting training providers and facilitating the development of scalable and sustainable
skill training initiatives. NSDC works closely with industry leaders, academic institutions,
and training partners to upskill India's workforce and enhance employability.
Key Focus Areas:
Funding and enabling skill training institutions
Promoting industry-relevant skill development
Supporting innovation in vocational education
Creating a skilled workforce aligned with market need.
An internship in the domain of AI Data Quality Analyst focuses on ensuring the accuracy,
consistency, and reliability of data used in AI systems. Here are some key aspects of this role:
1. Data Validation and Cleaning: Interns may work on identifying and correcting errors in datasets
to ensure high-quality inputs for AI models.
3. Quality Assurance: Monitoring and testing data pipelines to ensure they meet predefined standards.
4. Collaboration: Working with data scientists and engineers to understand data requirements and
improve data quality.
5. Tools and Techniques: Learning and using tools like Python, SQL, and data visualization software.
The primary aim of this internship was to gain practical experience in data preprocessing,
validation, and quality assurance within AI projects. Working alongside industry
professionals, I was able to understand the end-to-end data lifecycle and the challenges
faced in maintaining data quality in real-world applications. This report outlines the key
responsibilities, tools used, and insights gained throughout my internship journey.
In the current AI development landscape, data serves as the foundation upon which
machine learning and artificial intelligence models are built. However, many existing
systems face significant challenges related to data quality, which directly impacts the
performance and reliability of AI solutions. The existing system for managing AI data
typically involves the following components:
Data Collection Data is gathered from various sources such as sensors, APIs, web scraping,
user inputs, and existing databases. In many cases, this raw data is unstructured,
inconsistent, or incomplete.
Data Storage Collected data is stored in data lakes, cloud platforms, or local servers. While
storage solutions are scalable, they often lack integrated quality control mechanisms.
Data Preprocessing Data is cleaned, normalized, and formatted for model training.
However, this step is often semi-automated and lacks standardization, which can result in
poor-quality input data.
Quality Checks and Validation Quality assurance is either minimal or rule-based, lacking
the intelligent systems needed to detect deeper issues such as bias, duplication, or semantic
errors. This leads to challenges in maintaining reliable and ethical AI models.
To identify and analyze common data quality issues (e.g., missing values, mislabeled data,
duplicates) in datasets used for AI model training.
To implement data cleaning and preprocessing techniques using tools like Python, Pandas,
and Excel to improve dataset reliability.
To explore automation techniques for data validation and quality checks using AI or
scripting methods.
To document the entire data workflow and propose recommendations for future data quality
management improvements.
Data Collection (limited involvement): Reviewing the sources and formats of incoming
datasets.
Data Preprocessing & Cleaning: Detecting and resolving common issues like missing,
incorrect, or inconsistent entries.
Quality Metrics & Reporting: Measuring dataset quality using defined metrics such as
accuracy, completeness, and consistency.
Tools Used: Python (Pandas, NumPy), MS Excel, Jupyter Notebook, and internal
annotation tools provided by Rooman Technologies.
Problem Statement
In the rapidly evolving field of Artificial Intelligence (AI), the performance and reliability of
machine learning models are directly influenced by the quality of the data they are trained on.
Despite advancements in AI model architectures, many organizations continue to face
significant challenges related to the collection, preprocessing, labeling, and validation of data.
Inconsistent, incomplete, inaccurate, or biased datasets can lead to poor model performance,
ethical concerns, and flawed decision-making. At present, many existing systems rely on
manual or semi-automated processes for data quality management, which are often time-
consuming, error-prone, and difficult to scale. Additionally, the lack of standardized data
quality protocols, real-time monitoring tools, and feedback mechanisms further exacerbates the
issue.
Therefore, there is a pressing need for a comprehensive and intelligent data quality
management system that ensures accuracy, consistency, and usability of datasets throughout
the AI development lifecycle. The problem addressed in this internship was to identify these
challenges and contribute to the design and implementation of more robust data quality practices
within the AI development pipeline.
Artificial Intelligence models rely heavily on large volumes of high-quality data for training
and prediction. However, during my internship at NSDC and Rooman Technologies, it was
observed that the data used in various AI projects often contained several issues such as missing
values, incorrect labels, duplication, and inconsistency in formats. These data quality issues not
only delayed the model development process but also led to inaccurate outcomes during testing
METHODOLOGY
The methodology followed during the internship was structured into several phases,
focusing on identifying, cleaning, validating, and improving the quality of datasets used in
AI projects. Each phase was executed using a combination of manual review, scripting, and
automation tools, ensuring a systematic and efficient approach to data quality management.
Tasks included:
o Removing duplicates.
Checked for:
4. Quality Assessment
Collaborated with model development teams to understand the impact of data issues.
SYSTEM SPECIFICATION
An AI Data Quality Analyst plays a crucial role in ensuring clean, structured, and
accurate data for AI models. Here are some key system specifications and skills required for
this role:
Essential Skills
ETL Tools: Extract, Transform, Load tools like Informatica and Talend.
Technologies
1. Machine Learning (ML): Algorithms for detecting patterns, anomalies, and automating data
error rectification.
2. Natural Language Processing (NLP): Processes unstructured data like text documents and
social media posts.
4. Data Observability Tools: Platforms like Monte Carlo for monitoring data pipelines.
5. Cloud Platforms: AWS, Google Cloud, and Microsoft Azure for scalable data storage and
processing.
6. Data Validation Tools: Tools like Great Expectations and Apache Griffin for ensuring data
accuracy
RESULTS
Through hands-on sessions, real-world projects, and mentorship by experts, participants have gained
practical expertise in ensuring data integrity, handling large datasets, and applying analytical tools and
techniques. The training emphasized not only technical proficiency but also the importance of ethical
data handling and decision-making.
By bridging the gap between theoretical knowledge and practical application, this program contributes
to the national vision of creating a future-ready workforce skilled in emerging technologies. The
collaboration with Rooman Technologies and NSDC ensures that learners are not only employable
but also capable of driving innovation in the AI and data sectors.
https://www.montecarlodata.com/
https://openai.com/chatgpt/overview
https://copilot.microsoft.com/