0% found this document useful (0 votes)
18 views

Data Engineering

The internship report details Shafina M's experience in data engineering at Amagine Edu Tech, focusing on designing and managing data pipelines using tools like Oracle SQL, Talend, and Power BI. Key responsibilities included developing ETL processes, ensuring data quality, and creating dynamic dashboards for data visualization. The internship provided practical exposure to data workflows and enhanced skills in database management and business intelligence tools.

Uploaded by

Sh a
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Engineering

The internship report details Shafina M's experience in data engineering at Amagine Edu Tech, focusing on designing and managing data pipelines using tools like Oracle SQL, Talend, and Power BI. Key responsibilities included developing ETL processes, ensuring data quality, and creating dynamic dashboards for data visualization. The internship provided practical exposure to data workflows and enhanced skills in database management and business intelligence tools.

Uploaded by

Sh a
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI, KARNATAKA-590014

INTERNSHIP REPORT

On
Data Engineering
at
Amagine Edu Tech Private limited,
National Skill Development Corporation
(Government of India)

Submitted by

SHAFINA M
[USN:4DM21CS046]

In the partial fulfilment for the award of the degree of

BACHELOR OF ENGINEERING
In
COMPUTER SCIENCE AND ENGINEERING

Carried out at: Yenepoya Institute of Technology.


Duration: October 18,2024 to December 20,2024

YENEPOYA INSTITUTE OF TECHNOLOGY

N.H. 13, THODAR, MOODBIDRI-574225, MANGALORE, D.K


2024-25
YENEPOYA INSTITUTE OF TECHNOLOGY
THODAR, MIJAR POST, MOODBIDRI-574225

(Affiliated to Visvesvaraya Technological University, Belagavi)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that internship entitled ‘Data Engineering’ was carried out by Shafina M,
USN:4DM21CS046 of Computer Science & Engineering, during the academic year 2024-
2025 from October 18th 2024 to December 20th 2024 at “Yenepoya Institute of
Technology”, Moodbidri, and Karnataka.

_______________ _______________ _______________


Internship Skill Development
Head of Department
Co-ordinator Officer
Dr. Sangamesh C.J. Prof. Rohith A R Dr. Sangamesh C.J.
TABLE OF CONTENTS
Acknowledgement ...............................................................................................................1
Declaration ..........................................................................................................................2
Abstract................................................................................................................................3
Introduction.........................................................................................................................4
Data Engineering .................................................................................................................4
Data Visualization ...............................................................................................................5
Objectives ............................................................................................................................6
Technology Used .................................................................................................................7
Oracle SQL ........................................................................................................................7
Oracle SQL Developer ......................................................................................................7
Oracle SQL * Plus .............................................................................................................8
Talend Studio .....................................................................................................................9
Microsoft Power BI .........................................................................................................10
Tasks ..................................................................................................................................11
SQL Querying ..................................................................................................................11
ETL Pipeling ....................................................................................................................14
Data Visualization with Power BI ...................................................................................15
Conclusion .........................................................................................................................17
References ..........................................................................................................................18
LIST OF FIGURES

Fig No. Fig Name Page No.


4.1 Creation of Tables in SQL Developer 12
4.2 Insertion of Values 13
4.3 Database to Excel Migration 14
4.4 Output Excel File 15
4.5 Pie Chart Representation 16
4.6 Maps 16
Data Engineering

ACKNOWLEDGEMENT

The success and final outcome of this internship required a lot of guidance and assistance from many
people and we are extremely privileged to have got this all along the completion of our internship.
All that we have done is only due to such supervision and assistance and we will not forget to thank
them.

My deepest thanks to my internship Co-Ordinator Dr. Sangamesh C. Jalade, H.O.D, Dept. of


Computer Science & Engineering for the constant support and encouragement and providing with the
necessary advice and help. I am highly indebted to him for taking keen interest in my work,
monitoring and providing guidance throughout the internship.

I sincerely express my gratitude to Dr. Sangamesh C. Jalade, H.O.D., Dept. of Computer Science
& Engineering for his constant support and guidance for the successful completion of this Internship
Report.

I express my sincere gratitude to Prof. Rohith A R, Skill Development Coordinator for his immense
support during the internship period.

I take immense pleasure in thanking our beloved Vice Principal Dr. Prabhakara B K for his
constant support.

I also thank Amagine Edu Tech Private limited and National Skill Development
Corporation for providing the internship.

I also thank our faculties who were ready with a positive comment to help us all the time, whether it
was an off-hand comment to encourage us or a constructive piece of criticism.

At last, but not the least i want to thank my classmates and friends who appreciated my work
and motivated me.

Shafina M
[4DM21CS046]

Dept of CSE, YIT, Moodbidri Page 1


Data Engineering

DECLARATION

This is to certify that I have followed the guidelines provided by the University & Institute in
preparing this Internship report and whenever I have sent materials (data, theoretical analysis, figures
and text) from other sources, I have given due credit to them by citing them in the text of report and
getting their details in the references.

Shafina M
[4DM21CS046]

Dept of CSE, YIT, Moodbidri Page 2


Data Engineering

ABSTRACT

The data engineering internship provided an in-depth understanding of designing and managing data
pipelines, focusing on transforming raw data into actionable insights. During the internship, I
worked on critical aspects of data integration, transformation, and visualization using tools like
Oracle SQL, Talend, and Power BI. These tools enabled the automation of ETL (Extract, Transform,
Load) processes and efficient management of structured datasets to support analytical tasks.

Key responsibilities included developing data pipelines to migrate data from Excel to an Oracle SQL
database, ensuring data accuracy, consistency, and optimization. Using Talend components such as
tMap, tFilterRows, and tDBOutput, I automated workflows, applied transformations, and improved
data quality. This streamlined approach facilitated seamless integration and enhanced data usability.

To further enhance decision-making, I designed dynamic Power BI dashboards to visualize key


metrics and trends. These interactive dashboards offered stakeholders comprehensive insights into
the dataset, demonstrating the impact of structured and well-managed data pipelines.

Overall, this internship was instrumental in strengthening my skills in data engineering, database
management, and business intelligence tools. It provided valuable experience in end-to-end data
workflows, enabling me to bridge the gap between raw data and meaningful insights for decision
making processes.

Dept of CSE, YIT, Moodbidri Page 3


Data Engineering

CHAPTER 1

INTRODUCTION

1.1 Data Engineering

Data engineering is a critical discipline that focuses on the design, construction, and maintenance of
data infrastructure to support an organization’s analytics and decision-making processes. It involves
the creation of data pipelines, databases, and integration systems that enable the seamless flow and
transformation of raw data into structured, actionable formats. By leveraging tools like Talend,
Apache Spark, and cloud-based platforms such as AWS or Azure, data engineers ensure that data is
efficiently collected, processed, and stored for downstream applications. This field plays a pivotal
role in ensuring data quality, integrity, and accessibility, which are essential for enabling advanced
analytics, business intelligence, and machine learning models.

A data engineer's work revolves around scalability, reliability, and optimization of data systems to
meet an organization's growing needs. They collaborate closely with data analysts and scientists to
understand business requirements and create tailored solutions, ensuring that data systems are
aligned with organizational goals. By integrating various data sources and automating workflows,
data engineers enable businesses to unlock the potential of their data and make data-driven
decisions. This field not only bridges the gap between raw data and actionable insights but also lays
the foundation for innovation and operational efficiency in today's data-centric world.

1.2 Data Visualization

Data visualization is a powerful technique that transforms raw data into graphical representations,
enabling easier understanding and interpretation of complex information. It uses visual elements such
as charts, graphs, maps, and dashboards to present data in a way that highlights trends, patterns, and
insights. The primary goal of data visualization is to simplify the decision-making process by making
data more accessible and actionable for stakeholders. Tools like Power BI, Tableau, and Python
libraries such as Matplotlib and Seaborn are widely used to create interactive and visually appealing
representations of data. By presenting data visually, organizations can identify hidden insights and
make informed business decisions effectively.

Dept of CSE, YIT, Moodbidri Page 4


Data Engineering

In the modern era of data-driven decision-making, data visualization plays a crucial role in bridging the gap
between raw data and actionable insights. It allows businesses to monitor performance, analyze customer
behavior, track market trends, and evaluate key metrics at a glance. Interactive dashboards, for example,
enable real-time updates and drill-down capabilities, enhancing user engagement and providing deeper
insights. Furthermore, data visualization supports storytelling by connecting the dots between data points
and presenting information in a way that resonates with diverse audiences. As a result, it not only improves
understanding but also fosters a culture of data-driven strategies within organizations.

Dept of CSE, YIT, Moodbidri Page 5


Data Engineering

CHAPTER 2

OBJECTIVES

The key objectives are:

Develop Data Pipelines: Learn to design, build, and optimize data pipelines for
seamless data extraction, transformation, and loading (ETL).
Understand Data Integration: Gain hands-on experience in integrating data from
various sources into a unified system for efficient access and analysis.
Enhance Data Quality: Implement techniques to ensure data accuracy, consistency,
and reliability throughout the data lifecycle.
Work with Data Tools: Develop proficiency in tools like Talend, SQL, and cloud
platforms (AWS, Azure, etc.) for data processing and management.
Database Management: Learn to design, query, and maintain relational and non-
relational databases for structured and unstructured data storage.
Implement Scalability: Understand and implement scalable data infrastructure to
handle large volumes of data efficiently.
Collaboration: Collaborate with teams, such as data scientists and analysts, to
understand business requirements and deliver tailored solutions.
Automation: Explore techniques for automating data workflows to streamline
processes and enhance operational efficiency.
Problem-Solving Skills: Strengthen problem-solving and troubleshooting skills by
addressing real-world data challenges.
Practical Exposure: Apply theoretical knowledge to practical scenarios, enhancing
understanding of industry-specific data engineering applications.

Dept of CSE, YIT, Moodbidri Page 6


Data Engineering

CHAPTER 3

TECHNOLOGIES USED

3.1 Oracle SQL

Oracle SQL (Structured Query Language) is a powerful and widely used database language provided
by Oracle Corporation. It is designed for managing and querying relational databases, enabling users
to store, retrieve, update, and manipulate structured data efficiently. Oracle SQL supports a wide
range of features, including advanced data types, indexing, and robust transaction management,
making it a preferred choice for enterprise-scale database applications. With its rich set of SQL
commands, Oracle SQL simplifies tasks such as creating tables, defining relationships, and
performing complex joins and aggregations. It is widely used in industries to manage business-
critical data securely and reliably.

Oracle SQL also supports advanced functionalities like partitioning, materialized views, and query
optimization, which enhance performance and scalability for large datasets. Its integration with
Oracle's ecosystem, such as Oracle Database and Oracle Analytics, allows seamless data handling
and analysis. Whether it’s for application development, business intelligence, or data warehousing,
Oracle SQL serves as a foundational tool that enables organizations to derive meaningful insights
from their data while maintaining high standards of data integrity and security.

3.2 Oracle SQL Developer

Oracle SQL Developer is a graphical user interface (GUI) tool developed by Oracle Corporation
for working with Oracle databases. It provides an intuitive platform for managing database objects,
running SQL queries, and developing PL/SQL programs. SQL Developer simplifies database tasks
by offering features like query execution, debugging, and data modeling in a user-friendly
environment. This tool is particularly beneficial for database administrators (DBAs) and developers
as it eliminates the need for command-line interactions and streamlines workflows.

Dept of CSE, YIT, Moodbidri Page 7


Data Engineering

One of the standout features of Oracle SQL Developer is its ability to connect to multiple databases
simultaneously, supporting Oracle and non-Oracle databases such as MySQL and SQL Server. It
also includes features like SQL Worksheet for writing and testing SQL scripts, integrated version
control, and export/import utilities for data migration. By offering a robust yet user-friendly
interface, Oracle SQL Developer enhances productivity and simplifies database development and
management tasks, making it an indispensable tool for professionals working with Oracle databases.

A free graphical user interface, Oracle SQL Developer allows database users and administrators to
do their database tasks in fewer clicks and keystrokes. A productivity tool, SQL Developer's main
objective is to help the end user save time and maximize the return on investment in the Oracle
Database technology stack.

SQL Developer supports Oracle Database 10g, 11g, and 12c and will run on any operating system
that supports Java.

3.3 Oracle SQL* Plus

Oracle SQL* Plus is a command-line interface used to interact with Oracle databases. It allows users
to execute SQL commands and PL/SQL blocks directly, making it a versatile tool for database
administrators and developers. SQLPlus is lightweight, easy to use, and provides immediate
feedback for executed queries, which is particularly useful for troubleshooting and debugging. With
its scripting capabilities, SQL*Plus allows users to automate repetitive tasks by writing and
executing SQL scripts.

Despite being a basic tool, SQL* Plus offers powerful features such as the ability to format query
results, generate reports, and create dynamic scripts. It is often used in scenarios where a graphical
user interface is unnecessary or unavailable, such as on remote servers. SQLPlus also integrates
seamlessly with Oracle's other tools and technologies, making it an essential component of the
Oracle database ecosystem.

SQL*Plus has its own commands and environment, and it provides access to the Oracle Database.

It enables you to enter and execute SQL, PL/SQL, SQL*Plus and operating system commands to
perform the following:

Dept of CSE, YIT, Moodbidri Page 8


Data Engineering

• Format, perform calculations on, store, and print from query results

• Examine table and object definitions

• Develop and run batch scripts

• Perform database administration

3.4 Talend Studio

Talend Studio is a comprehensive data integration tool that simplifies the process of extracting,
transforming, and loading (ETL) data from various sources. It provides an intuitive drag-and-drop
interface that allows users to design and automate complex data workflows without extensive
coding. Talend Studio supports integration with diverse data sources, including databases, cloud
services, and APIs, making it a versatile solution for data migration, cleansing, and transformation
tasks.

One of Talend Studio’s strengths lies in its rich library of pre-built components, such as tMap,
tFilterRow, and tUniqRow, which enable efficient data manipulation and customization.
Additionally, Talend supports big data processing and real-time data integration, making it suitable
for modern data engineering challenges. Its ability to generate Java code for workflows enhances
flexibility and scalability. Talend Studio empowers organizations to manage data pipelines
effectively, ensuring high-quality data for analytics and business intelligence.

Talend Studio provides a range of SQL templates to simplify the most common data query and
update, schema creation and modification, and data access control tasks. It also comprises a SQL
editor which allows you to customize or design your own SQL templates to meet fewer common
requirements.

Dept of CSE, YIT, Moodbidri Page 9


Data Engineering

3.5 Microsoft Power BI

Microsoft Power BI is a business analytics tool that enables users to visualize and analyze data
through interactive dashboards and reports. It provides an intuitive interface for transforming raw
data into actionable insights, helping businesses monitor performance and make informed decisions.
Power BI supports seamless integration with various data sources, including Excel, SQL databases,
cloud platforms, and APIs, allowing users to consolidate data from multiple systems into a single,
unified view.

One of the key features of Power BI is its ability to create dynamic, real-time dashboards that offer
drill-down capabilities and user interaction. It also includes advanced analytics features, such as
DAX (Data Analysis Expressions) for custom calculations and AI-powered insights for predictive
analysis. Power BI’s cloud-based service allows users to share reports and collaborate across teams
effortlessly. With its accessibility, scalability, and user-friendly design, Power BI has become a vital
tool for businesses aiming to adopt data-driven decision-making.

One common workflow in Power BI begins by connecting to data sources in Power BI Desktop and
building a report. You then publish that report from Power BI Desktop to the Power BI service, and
share it so business users in the Power BI service and on mobile devices can view and interact with
the report.

Dept of CSE, YIT, Moodbidri Page 10


Data Engineering

CHAPTER 4

TASKS

4.1 SQL Querying

The Railway Management System Project is a comprehensive data solution aimed at enhancing the efficiency,
decision-making, and overall performance of railway operations. In this project, a robust data pipeline was
established using SQL for database management, Talend for data integration and transformation, and Power
BI for visualization and reporting. Each of these technologies was specifically chosen to address the unique
needs within the railway industry, including managing train schedules, tracking passenger journeys,
processing ticketing information, and analyzing operational performance.

At the heart of the project is the SQL database, which was carefully designed to handle large volumes of
transactional data with high efficiency. The SQL development process focused on structuring the database to
support various types of queries, such as retrieving train schedules, calculating revenue from ticket sales,
maintaining passenger records, and tracking train locations. Advanced SQL queries and techniques were
implemented to ensure optimized data retrieval, maintaining the integrity and accuracy of mission-critical
data, which is essential for daily operations and informed decision-making. Additionally, the database design
included features for real-time data updates, such as train delays or cancellations, to ensure seamless
communication across different departments and improve operational responsiveness.

Dept of CSE , YIT, Moodbidri Page 11


Data Engineering

Fig 4.1: Creation of Tables in SQL Developer

Dept of CSE , YIT, Moodbidri Page 12


Data Engineering

Fig 4.2: Insertion of Values

Dept of CSE , YIT, Moodbidri Page 13


Data Engineering

4.2 ETL pipelining

The ETL (Extract, Transform, Load) process, powered by Talend, played a key role in ensuring smooth data
integration from multiple external sources into the centralized railway management database. Raw data from
Excel files, along with other formats, was processed using Talend’s suite of tools like TFileInputExcel, tMap,
and tDBOutput. This allowed for the seamless migration of data from different railway systems, ensuring
consistency across all platforms and making it easier to manage and analyze the data in one unified location.
This step was crucial for handling the diverse data structures typical of the railway industry, such as passenger
bookings, train schedules, maintenance logs, and ticketing transactions.

In the context of railway management systems, Talend can be used to extract data from diverse sources like
ticket reservation systems, train tracking platforms, and maintenance databases. The data can then be cleaned,
transformed, and loaded into centralized systems or analytical platforms. Talend's components, such as tMap
and tFilter, allow railway operators to ensure data consistency and prepare datasets for deeper analysis. This
integration capability makes Talend indispensable for railways that handle large volumes of data across
various channels.

Fig 4.3: Database to Excel Migration

Dept of CSE , YIT, Moodbidri Page 14


Data Engineering

Fig 4.4: Output Excel File

4.3 Data Visualization with Power BI

Power BI adds a layer of visualization and analytics, making it a vital component of modern railway
management systems. With Power BI, railway operators can create interactive dashboards and
reports that showcase key performance metrics, such as passenger flow, train punctuality, and
maintenance schedules. Its ability to integrate with SQL databases and Talend workflows ensures a
seamless data pipeline, providing real-time insights. For example, a Power BI dashboard could
visualize train occupancy rates against ticket sales trends, enabling better route optimization and
scheduling. This analytical edge supports data-driven decision-making, helping railway systems
operate efficiently in a dynamic environment.

Power BI completes the system by providing advanced analytics and visualization capabilities.
Interactive dashboards and reports will be developed to give railway administrators actionable
insights into ticket sales trends, passenger demographics, train schedules, and maintenance activities.
These visualizations will help decision-makers identify areas of improvement, track key
performance indicators (KPIs), and optimize resource allocation for enhanced operational efficiency.

Dept of CSE , YIT, Moodbidri Page 15


Data Engineering

Fig 4.5: Pie chart Representation

Fig 4.6: Maps

Dept of CSE , YIT, Moodbidri Page 16


Data Engineering

CHAPTER 6

CONCLUSION

The data engineering internship was a transformative experience that allowed me to gain practical
knowledge in managing and integrating data effectively. I worked on projects that involved
streamlining data migration processes using tools like Talend, where I mastered components such as
TFileInputExcel, tMap, and tDBOutput to ensure seamless data transfer. Additionally, I enhanced
my skills in data visualization by creating insightful Power BI reports, turning raw data into
actionable insights. These hands-on experiences not only strengthened my technical expertise but
also improved my problem-solving and analytical abilities. This internship laid a solid foundation for
my career in data engineering, equipping me with the skills to handle real-world data challenges
efficiently

Dept of CSE , YIT, Moodbidri Page 17


Data Engineering

REFERENCES

[1] Talend Documentation: https://help.talend.com/

[2] Power BI Documentation: https://docs.microsoft.com/en-us/power-bi/

[3] "Railway Management: A Case Study Approach" by John S. Beck, Delmar

Cenage Learning.

[4] Qlik Sense: https://www.qlik.com

[5] Oracle Downloads: https://www.oracle.com

Dept of CSE , YIT, Moodbidri Page 18

You might also like