0% found this document useful (0 votes)
238 views22 pages

Intro Intern Final Merged

Uploaded by

yvamshitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
238 views22 pages

Intro Intern Final Merged

Uploaded by

yvamshitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi-590018

Internship Report

On

“AI Data Quality Analyst”


submitted in partial fulfillment of the requirements for the IV-year degree of

Bachelor of Engineering in Computer Science and Engineering

submitted by

VAMSHITHA
1ME21CS110
Internship carried out at

M S ENGINEERING COLLEGE, BENGALURU

Under the Guidance of


Dr. MALATESH S H
HOD, CSE DEPT
M.S. Engineering College

M S ENGINEERING COLLEGE
NAAC Accredited, affiliated to VTU, Belagavi, Approved by AICTE New Delhi,

Navarathna Agrahara, off Intl. Airport Road, Bengaluru– 562110


2024-2025
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

Certified that the Internship work on topic Ai Data Quality Analyst has been
successfully carried out at M S Engineering College by VAMSHITHA, bearing USN
1ME21CS110, in partial fulfillment of the requirements for the IV Semester, degree of
Bachelor of Engineering in Computer Science and Engineering of Visvesvaraya
Technological University, Belgaum during academic year 2024-2025. It is certified that
all corrections/suggestions indicated for Internal Assessment have been incorporated in
the report deposited in the departmental library. The Internship report has been approved
as it satisfies the academic requirements in respect of Internship work prescribed for the
said degree.

-------------------------------- ---------------------------------------
Dr. MALATESH S H Dr. RANAPRATHAP REDDY

Internship Coordinator Principal


CERTIFICATE
DECLEARTATION

I, VAMSHITHA [USN: 1ME21CS110], student of IV Year BE, in Computer Science


and Engineering, M.S Engineering College hereby declare that the Internship work
entitled “Ai Data Quality Analyst” has been carried out by me and submitted in partial
fulfillment of the requirements for the IV Year degree of Bachelor of Engineering in
Computer Science and Engineering of Visvesvaraya Technological University,
Belagavi during academic year 2024- 2025.

Date: VAMSHITHA

Place: Bengaluru 1ME21CS110


ACKNOWLEDGMENT

The satisfaction and euphoria that accompany the successful completion of any task would
be incomplete without the mention of the people who made it possible, whose constant
guidance and encouragement crowned the efforts with success.

I would like to profoundly thank Management of M S Engineering College for


providing such a healthy environment for the successful completion of Internship work.

I would like to express my thanks to the Principal Dr. RANAPRATAP REDDY, for
their encouragement that motivated me for the successful completion of Internship
work.

It gives me immense pleasure to thank Dr. Malatesh S H Professor and Head of


Department for his constant support and encouragement.

Last, but not the least, I would hereby acknowledge and thank my parents who have been
a source of inspiration and also instrumental in the successful completion of the Internship
work.

VAMSHITHA
1ME21CS110
ABSTRACT

This internship at NSDC (National Skill Development Corporation) in collaboration with Rooman
Technologies focused on the critical role of data quality in AI and machine learning systems. As an AI
Data Quality Analyst intern, I was involved in the collection, preprocessing, validation, and analysis of
datasets used in training and evaluating AI models.
The primary objective was to ensure that the data used for AI development was accurate, consistent, and
relevant, thereby improving the performance and reliability of AI solutions. Tasks included data cleaning,
annotation verification, error detection, and maintaining documentation of data pipelines. I also
contributed to creating guidelines for data labelling and quality assurance protocols.
Through this internship, I gained hands-on experience with tools and techniques used in AI data
management, including Python-based libraries (such as Pandas and NumPy), data visualization tools, and
version control systems. It provided valuable exposure to real-world AI projects and reinforced the
importance of high-quality data in driving effective and ethical AI outcomes.
SL No. CONTENTS Page No.

Chapter 1 COMPANY PROFILE 01

Chapter 2 INTERNSHIP DOMAIN 03

Chapter 3 INTRODUCTION 04

Chapter 4 OBJECTIVES 05
4.1 problem statement

Chapter 5 METHODOLOGY 07

Chapter 6 SYSTEM SPECIFICATION 09

Chapter 7 RESULTS 11

CONCLUSION 13

REFERENCE 14
CHAPTER 1
COMPANY PROFILE

National Skill Development Corporation (NSDC)

The National Skill Development Corporation (NSDC) is a public-private partnership


organization under the Ministry of Skill Development and Entrepreneurship (MSDE),
Government of India. Established in 2008, NSDC aims to promote skill development by
catalyzing the creation of large, high-quality vocational training institutions across the
country.

Its primary objective is to bridge the skill gap in various sectors by funding and
supporting training providers and facilitating the development of scalable and sustainable
skill training initiatives. NSDC works closely with industry leaders, academic institutions,
and training partners to upskill India's workforce and enhance employability.
Key Focus Areas:
 Funding and enabling skill training institutions
 Promoting industry-relevant skill development
 Supporting innovation in vocational education
 Creating a skilled workforce aligned with market need.

Rooman Technologies Pvt. Ltd.

Rooman Technologies is a leading IT training and workforce development company in


India, known for delivering industry-focused technical education. Since its inception in
1999, Rooman has been involved in providing training in areas such as networking,
cybersecurity, software development, and data science.

As an official training partner of NSDC, Rooman Technologies plays a significant


role in implementing skill development projects, especially in emerging fields like Artificial
Intelligence, Cloud Computing, and Data Analytics. The company combines hands-on
learning with real-world project exposure to prepare students and professionals for
employment and entrepreneurship.
Dept of CSE, MSEC, Bengaluru Page 1
Ai Data Quality Analyst 2024-2025

Together, NSDC and Rooman Technologies are contributing to a national mission of


empowering youth with in-demand technical skills, with a focus on employability and
digital transformation.
Core Services:
 Skill development and training programs
 Corporate IT training and certifications
 Industry-integrated internship and placement support
 Projects aligned with NSDC and government initiatives like Skill India

Dept of CSE, MSEC, Bengaluru Page 2


CHAPTER 2
INTERNSHIP DOMAIN

An internship in the domain of AI Data Quality Analyst focuses on ensuring the accuracy,
consistency, and reliability of data used in AI systems. Here are some key aspects of this role:

1. Data Validation and Cleaning: Interns may work on identifying and correcting errors in datasets
to ensure high-quality inputs for AI models.

2. Data Annotation: Assisting in labeling data for supervised learning tasks.

3. Quality Assurance: Monitoring and testing data pipelines to ensure they meet predefined standards.

4. Collaboration: Working with data scientists and engineers to understand data requirements and
improve data quality.

5. Tools and Techniques: Learning and using tools like Python, SQL, and data visualization software.

Dept of CSE, MSEC, Bengaluru Page 3


CHAPTER 3
INTRODUCTION

The primary aim of this internship was to gain practical experience in data preprocessing,
validation, and quality assurance within AI projects. Working alongside industry
professionals, I was able to understand the end-to-end data lifecycle and the challenges
faced in maintaining data quality in real-world applications. This report outlines the key
responsibilities, tools used, and insights gained throughout my internship journey.

In the current AI development landscape, data serves as the foundation upon which
machine learning and artificial intelligence models are built. However, many existing
systems face significant challenges related to data quality, which directly impacts the
performance and reliability of AI solutions. The existing system for managing AI data
typically involves the following components:

Data Collection Data is gathered from various sources such as sensors, APIs, web scraping,
user inputs, and existing databases. In many cases, this raw data is unstructured,
inconsistent, or incomplete.

Data Storage Collected data is stored in data lakes, cloud platforms, or local servers. While
storage solutions are scalable, they often lack integrated quality control mechanisms.

Data Labelling and Annotation Labelling is often done manually or through


crowdsourcing platforms. This process is time-consuming and prone to human error,
leading to inaccuracies and inconsistencies in the labelled datasets.

Data Preprocessing Data is cleaned, normalized, and formatted for model training.
However, this step is often semi-automated and lacks standardization, which can result in
poor-quality input data.

Quality Checks and Validation Quality assurance is either minimal or rule-based, lacking
the intelligent systems needed to detect deeper issues such as bias, duplication, or semantic
errors. This leads to challenges in maintaining reliable and ethical AI models.

Dept of CSE, MSEC, Bengaluru Page 4


CHAPTER 4
OBJECTIVES

To identify and analyze common data quality issues (e.g., missing values, mislabeled data,
duplicates) in datasets used for AI model training.

To implement data cleaning and preprocessing techniques using tools like Python, Pandas,
and Excel to improve dataset reliability.

To assist in the development of standardized data labeling and annotation guidelines to


minimize human error and ensure consistency.

To explore automation techniques for data validation and quality checks using AI or
scripting methods.

To evaluate the impact of improved data quality on AI model performance through


comparison and feedback.

To document the entire data workflow and propose recommendations for future data quality
management improvements.

Data Collection (limited involvement): Reviewing the sources and formats of incoming
datasets.

Data Preprocessing & Cleaning: Detecting and resolving common issues like missing,
incorrect, or inconsistent entries.

Data Annotation Review: Validating labels and identifying labeling errors.

Quality Metrics & Reporting: Measuring dataset quality using defined metrics such as
accuracy, completeness, and consistency.

Tools Used: Python (Pandas, NumPy), MS Excel, Jupyter Notebook, and internal
annotation tools provided by Rooman Technologies.

Dept of CSE, MSEC, Bengaluru Page 5


Ai Data Quality Analyst 2024-2025

Problem Statement

In the rapidly evolving field of Artificial Intelligence (AI), the performance and reliability of
machine learning models are directly influenced by the quality of the data they are trained on.
Despite advancements in AI model architectures, many organizations continue to face
significant challenges related to the collection, preprocessing, labeling, and validation of data.
Inconsistent, incomplete, inaccurate, or biased datasets can lead to poor model performance,
ethical concerns, and flawed decision-making. At present, many existing systems rely on
manual or semi-automated processes for data quality management, which are often time-
consuming, error-prone, and difficult to scale. Additionally, the lack of standardized data
quality protocols, real-time monitoring tools, and feedback mechanisms further exacerbates the
issue.

Therefore, there is a pressing need for a comprehensive and intelligent data quality
management system that ensures accuracy, consistency, and usability of datasets throughout
the AI development lifecycle. The problem addressed in this internship was to identify these
challenges and contribute to the design and implementation of more robust data quality practices
within the AI development pipeline.

Artificial Intelligence models rely heavily on large volumes of high-quality data for training
and prediction. However, during my internship at NSDC and Rooman Technologies, it was
observed that the data used in various AI projects often contained several issues such as missing
values, incorrect labels, duplication, and inconsistency in formats. These data quality issues not
only delayed the model development process but also led to inaccurate outcomes during testing

Dept of CSE, MSEC, Bengaluru Page 6


CHAPTER 5

METHODOLOGY

The methodology followed during the internship was structured into several phases,
focusing on identifying, cleaning, validating, and improving the quality of datasets used in
AI projects. Each phase was executed using a combination of manual review, scripting, and
automation tools, ensuring a systematic and efficient approach to data quality management.

1. Data Collection Review

 Examined existing datasets used in ongoing AI projects.

 Identified data sources (CSV files, databases, web scraping outputs).

 Verified initial format and structure to assess readiness for preprocessing.

2. Data Preprocessing and Cleaning

 Used Python (Pandas, NumPy) to clean and transform datasets.

 Tasks included:

o Handling missing values (e.g., imputation or removal).

o Removing duplicates.

o Standardizing date formats and categorical values.

o Detecting and resolving outliers using statistical methods.

3. Data Annotation Validation

 Reviewed labeled data used for supervised machine learning tasks.

 Checked for:

o Incorrect or inconsistent labeling.

o Ambiguities in class definitions.

Dept of CSE, MSEC, Bengaluru Page 7


Ai Data Quality Analyst 2024-2025

o Mismatches between inputs and labels.

 Used Excel and internal labeling platforms for manual corrections.

4. Quality Assessment

 Applied quality metrics to assess dataset integrity:

o Accuracy (correctness of labels).

o Completeness (presence of all required data).

o Consistency (uniform formats across fields).

 Created quality score sheets and dashboards for reporting.

5. Automation and Scripting

 Developed simple automation scripts to:

o Perform recurring cleaning tasks.

o Validate data formats automatically.

o Generate quality reports in a consistent format.

6. Feedback and Refinement

 Collaborated with model development teams to understand the impact of data issues.

 Incorporated feedback to refine cleaning and validation techniques.

 Documented findings and improvements for future use.

Dept of CSE, MSEC, Bengaluru Page 8


CHAPTER 6

SYSTEM SPECIFICATION

An AI Data Quality Analyst plays a crucial role in ensuring clean, structured, and
accurate data for AI models. Here are some key system specifications and skills required for
this role:

Essential Skills

 Data Management: Proficiency in handling large datasets.

 Data Cleaning and Preprocessing: Techniques to identify and remove errors.

 Machine Learning Basics: Understanding how data impacts model training.

 Programming Skills: Knowledge of Python, R, and SQL.

 Attention to Detail: Critical for identifying data anomalies.

Tools and Platforms

 ETL Tools: Extract, Transform, Load tools like Informatica and Talend.

 Data Validation Tools: Great Expectations, Apache Griffin.

 Data Wrangling Tools: OpenRefine, Pandas.

 Cloud Platforms: AWS, Google Cloud, Microsoft Azure.

 Collaboration Tools: GitHub, JIRA.

Technologies

1. Machine Learning (ML): Algorithms for detecting patterns, anomalies, and automating data
error rectification.

2. Natural Language Processing (NLP): Processes unstructured data like text documents and
social media posts.

Dept of CSE, MSEC, Bengaluru Page 9


Ai Data Quality Analyst 2024-2025

3. AI-Powered Automation: Automates data cleansing, enrichment, and real-time assessments.

4. Data Observability Tools: Platforms like Monte Carlo for monitoring data pipelines.

5. Cloud Platforms: AWS, Google Cloud, and Microsoft Azure for scalable data storage and
processing.

6. Data Validation Tools: Tools like Great Expectations and Apache Griffin for ensuring data
accuracy

Dept of CSE, MSEC, Bengaluru Page 10


CHAPTER 7

RESULTS

The internship at NSDC in collaboration with Rooman Technologies resulted in significant


learning outcomes and tangible contributions to ongoing AI projects. By actively engaging in
data quality processes, I was able to directly impact the reliability and effectiveness of machine
learning models being developed.

Dept of CSE, MSEC, Bengaluru Page 11


Ai Data Quality Analyst 2024-2025

Dept of CSE, MSEC, Bengaluru Page 12


CONCLUSION
The AI Data Quality Analyst program, delivered in collaboration with NSDC (National Skill
Development Corporation) and Rooman Technologies, has provided a comprehensive and industry-
aligned training experience. This initiative has empowered learners with essential skills in data
management, data quality assessment, and AI integration—key competencies in today’s data-driven
landscape.

Through hands-on sessions, real-world projects, and mentorship by experts, participants have gained
practical expertise in ensuring data integrity, handling large datasets, and applying analytical tools and
techniques. The training emphasized not only technical proficiency but also the importance of ethical
data handling and decision-making.

By bridging the gap between theoretical knowledge and practical application, this program contributes
to the national vision of creating a future-ready workforce skilled in emerging technologies. The
collaboration with Rooman Technologies and NSDC ensures that learners are not only employable
but also capable of driving innovation in the AI and data sectors.

Dept of CSE, MSEC, Bengaluru Page 13


REFERENCES
 AI-Data Analyst - Rooman Technologies

 AI - Data Quality Analyst v2 | National Skill Development Corporation (NSDC)

 https://www.montecarlodata.com/

 https://openai.com/chatgpt/overview

 https://copilot.microsoft.com/

Dept of CSE, MSEC, Bengaluru Page 14

You might also like