Altair Data Science Internship
Altair Data Science Internship
AN INTERNSHIP REPORT
Submitted by
P. YASO KRISHNA
Register No: 21K61A0648
BACHELOR OF TECHNOLOGY
In
NOVEMBER 2024
i
Certificate
ii
BONAFIDE CERTIFICATE
I hereby declare that the internship report titled "Altair Data Science Virtual Internship" is a record of
original work completed by me during my internship period at Sasi Institute of Technology &
Engineering from the month of April to June. The report is based on my experience and insights gained
from doing the internship course.
I confirm that this report has not been submitted previously in any form to any other university,
institute, or publication for any purpose. I have also ensured that all information and
observations are accurately represented to the best of my knowledge, and any reference to data,
sources, or colleagues' work is duly acknowledged.
iii
DECLARATION
P. YASO KRISHNA
21K61A0648
iv
ABSTRACT
v
ACKNOWLEDGEMENT
First and foremost, I would like to thank the Lord Almighty for giving me the strength and
knowledge, ability and opportunity to undertake this internship and to preserve it and also
complete it satisfactorily.
I would like to express my heartiest gratitude and thanks to my supervisor Dr.K.Uma,
Associate Professor, Department of Computer Science & Technology, Sasi Institute of
Technology, Tadepalligudem for her valuable guidance, continuous support and
encouragement at all stages of this research. The successful and timely completion of this
internship is only possible through her inspiration and constructive comments.
I owe my gratitude to Dr.Mohamad Ismail, Principal, Sasi Institute of Technology,
Tadepalligudem, for his sustained support. I express my heartfelt thanks to Dr.P.Kiran Kumar,
Professor & Head, , Department of Computer Science & Technology, Sasi Institute of
Technology, Tadepalligudem for his kind support in persuing this internship.
P. YASO KRISHNA
vi
Vision & Mission
vii
PEO’S and PSO
Program Outcomes
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex
engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
viii
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member
or leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
ix
List of Contents Topic
Pg. No
x
2.4.2. Consumer 12
2.4.3. Directed Acyclic Graph Scheduling 12
2.5. Functional Layer 12
2.6. Data Science Process 13
x
List of Figures
Pg. No
2. Data Vault 4
6. Analysis of Data 19
7. Categories of Data 21
8. Time-Person-Object-Location-Event high- 22
level design
9. Time Link 23
xiii
CHAPTER 1
Data Science Technology Stack
1.1. Rapid Information Factory (RIF) Ecosystem:
Rapid Information Factory (RIF) System is a technique and tool which is used for
processing the data in the development. The Rapid Information Factory is a massive parallel
data processing platform capable of processing theoretical unlimited size data sets.
The Rapid Information Factory (RIF) platform supports five high-level layers:
Functional Layer:
The functional layer is the core processing capability of the factory.
Core functional data processing methodology is the R-A-P-T-O-R framework.
1. Maintenance Utilities
2. Data Utilities
3. Processing Utilities
Business Layer:
1
Contains the business requirements (Functional and Non-functional
There are two basic data processing tools to perform the practical of data science as
given below:
1. Schema on read ecosystem does not need schema, without this you can load the
data into the database.
2. It has the capabilities to store the structure, semi-structure, unstructured data and it
has potential to apply most of the flexibilities when we request the query during
the execution.
This is the place where you can store three types of data structure, semi-structure,
unstructured data with no fix amount of limit and storage to store the data.
Data Lake follow to store less data into the structure database because it follows the
schema on read process architecture to store the data.
Data Lake allow us to transform the raw data that means structure, semi-structure,
unstructured data into the structure data format so that SQL query could be performed
for the analysis.
Data Lake is similar to real time river or lake where the water comes from different-
different places and at the last all the small- small river and lake are merged into the
big river or lake where large amount of water are stored, whenever there is need of
water then it can be used by anyone.
2
Fig1: Representation of data lake
Data Lake and Data Vault are built by using the three main component or structure of
data i.e. Hub, Link and satellite.
1.4.1. Hub:
Hub has unique business key with low amount of data to be changed and meta
data that means data is the main source of generating the hubs.
1.4.2. Link:
Hub has unique business key with low amount of data to be changed and
meta data that means data is the main source of generating the hubs.
1.4.3. Satellites:
3
When the hubs and links produce and form the structure of satellites which
store no chronological structure of data means then it would not provide the
information about the mean, median, mode, maximum, minimum, sum of
the data.
Satellites are the strong structure of data that store a detailed information
about the related data or business characteristics key and stores large
volume of data vault.
The combinations of all these three i.e. hub, link, and satellites are formed
together to help the data analytics and data scientists and data engineer to
store the business structure, types of information or data into it.
Most of the data scientist and data analysis, data engineer uses these data science
processing tool to process and transfer the data vault into data warehouse.
1.5.1. Spark:
2. Apache Spark has the capabilities and potential, process all types and variety
of data with repositories including Hadoop distributed file system, NoSQL
database as well as apache spark.
2. By using spark core, you can have more complex queries that will help us to
work with complex environment.
3. The distributed nature of spark ecosystem enables you the same processing
data on a small cluster, to go for hundreds or thousand of nodes without
making any changes.
4. Apache Spark Core has come up with advanced analytics that means it does
not support the map and reduce the potential and capabilities to support SQl
Queries, Machine Learning and graph Algorithms.
5
Fig4: Apache Spark Core Representation
2. Spark SQL is fast clustering data abstraction, so that data manipulation can
be done for fast computation.
3. It enables the user to run SQL/HQL on top of the spark and by suing this, we
can process the structure, unstructured and semi structured data.
2. Stream divide the incoming input data into the small-small unit of data for
further data analytics and data processing for next level.
3. There are multilevel of processing involved in it. Live streaming data are
received and divided into small-small parts or batches and these small-small
of data or batches are then processed or mixed by the spark engine to
generate or produced the final level of streaming of data.
6
4. Processing of data in the system in Hadoop has very high latency means that
data is not received on timely manner and it is not suitable for real time
processing requirement.
1.5.5. GraphX:
1. GraphX is very powerful graph processing tool application programming
interface for apache spark analytics engine.
3. GraphX follow the ETL process that means Extract, transform and Load,
exploratory analysis, iterative graph computation within a single system.
4. The usage can be seen in the Facebooks, LinkedIn connection, google map,
and internet routers use these types of tool for better response and analysis.
It is used for the replacement of the documents and data store in the
database like mongo dB etc.
Elastic search is one of the popular search engines and mostly used by
the recent organization like google, stack Overflow, GitHub and much
more.
1.6.2. R Language:
1.6.3. Scala:
Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.
Most of the data science project and framework are build by using the
Scala programming language because it has so many capabilities and
potential to work with it.
8
Scala integrate the feature of object-oriented language and its function
because Scala can be written in java, c++, python language.
1.6.4. Python:
Python is a programming language and it can used on a server to create
web application.
Python can handle the large amount of data and it is capable and
potential to perform the complex task on data.
9
CHAPTER 2
Three Management Layers
2.1. Introduction:
The Three Management Layers are a very important part of the framework. They
watch the overall operations in the data science ecosystem and make sure that things are
happening as per plan.
This layer is the center for complete processing capability in the data science
ecosystem.
This layer stores what you want to process along with every processingschedule and
workflow for the entire ecosystem.
5. Overall Communication
6. Overall Alerting
10
2.3.1. Audit:
An audit refers to an examination of the ecosystem that is systematic
and independent.
2.3.2. Balance:
The balance sublayer has the responsibility to make sure that the data
science environment is balanced between the available processing
capability against the required processing capability or has the ability
to upgrade processing capability during periods of extreme processing.
2.3.3. Control:
The cause-and-effect analysis system is the core data source for the
distributed control system in the ecosystem.
2.4.1. Producer:
The producer is the part of the system that generates the requests for data
science processing, by creating structures messages for each type of data
science process it requires.
The producer is the end point of the pipeline that loads messages into Kafka.
2.4.2. Consumer:
The consumer is the part of the process that takes in messages and organizes
them for processing by the data science tools.
The consumer is the end point of the pipeline that offloads the messages from
Kafka.
This solution uses a combination of graph theory and publish subscribe stream
data processing to enable scheduling.
You can use the Python NetworkX library to resolve any conflicts, by simply
formulating the graph into a specific point before or after you send or receive
messages via Kafka.
The cause-and-effect analysis system is the part of the ecosystem that collects
all the logs, schedules, and other ecosystem-related information and enables
data scientists to evaluate the quality of their system.
12
The functional layer of the data science ecosystem is the largest and most
essential layer for programming and modeling.
13
CHAPTER 3
Retrieve Super Step
3.1. Introduction:
Verify the hypothesis using real-world evidence- Now, we verify our hypothesis by
comparing it with real-world evidence
The Retrieve super step is the first contact between your data science and the source
systems.
The successful retrieval of the data is a major stepping-stone to ensuring that you are
performing good data science.
Data lineage delivers the audit trail of the data elements at the lowest granular level,
to ensure full data governance.
The data lake is the complete data world your company interacts with during its
business life span.
In simple terms, if you generate data or consume data to perform your business tasks,
that data is in your company’s data lake.
14
Just as a lake needs rivers and streams to feed it, the data lake will consume an
unavoidable deluge of data sources from upstream and deliver it to downstream
partners
Simply dumping a horde of data into a data lake, with no tangible purpose in
mind, will result in a big business risk.
The data lake must be enabled to collect the data required to answer your
business questions.
Data quality can cause the invalidation of a complete data set, if not dealt with
correctly.
15
It simply collects together into a worse problem, if not managed.
People, process, and technology are the three cornerstones to ensure that data is
curated and protected.
You are responsible for your people; share the knowledge you acquire from this book.
The process I teach you, you need to teach them. Alone, you cannot achieve success.
3. Port - A Portis any point from which you have to exit or enter a country.
Normally, these are shipping ports or airports but can also include border
crossings via road. Note that there are two ports in the complete process.
This is important. There is a port of exit and a port of entry.
4. Ship - Ship is the general term for the physical transport method used
for the goods. This can refer to a cargo ship, airplane, truck, or even
person, but it must be identified by a unique allocation number.
Microsoft SQL Server - Microsoft SQL server is common in companies, and this
connector supports your connection to the database.
MySQL - MySQL is widely used by lots of companies for storing data. This opens
that data to your data science with the change of a simple connection string.
Microsoft Excel - Excel is common in the data sharing ecosystem, and it enables you
to load files using this format with ease.
17
CHAPTER 4
Assess Super Step
4.1. Introduction:
Data quality problems result in a 20% decrease in worker productivity and explain
why 40% of business initiatives fail to achieve set goals. Incorrect data can harm a reputation,
misdirect resources, slow down the retrieval of information, and lead to false insights and
missed opportunities.
For example, if an organization has the incorrect name or mailing address of a
prospective client, their marketing materials could go to the wrong recipient. If sales data is
attributed to the wrong SKU or brand, the company might invest in a product line with less
than stellar customer demand.
Data profiling is the process of examining, analyzing and reviewing data to collect
statistics surrounding the quality and hygiene of the dataset. Data quality refers to the
accuracy, consistency, validity and completeness of data. Data profiling may also be known
as data archeology, data assessment, data discovery or data quality analysis.
4.2. Errors:
Errors are the norm, not the exception, when working with data. By now, you’ve
probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot safely
18
assume that any of the data we work with is error-free, our mission should be to find and
tackle errors in the most efficient way possible.
19
quality techniques into their business processes and into their enterprise applications and data
integration.
4.3.1. Completeness:
Completeness is defined as expected comprehensiveness. Data can be
complete even if optional data is missing. As long as the data meets the expectations
then the data is considered complete. For example, a customer’s first name and last
name are mandatory but middle name is optional; so a record can be considered
complete even if a middle name is not available.
4.3.2. Consistency:
Consistency means data across all systems reflects the same information and
are in synch with each other across the enterprise.
4.3.3. Conformity:
Conformity means the data is following the set of standard data definitions
like data type, size and format. For example, date of birth of customer is in the format
“mm/dd/yyyy”.
20
4.3.4. Accuracy:
Accuracy is the degree to which data correctly reflects the real world object or
an event being described. Examples:Sales of the business unit are the real value.
Address of an employee in the employee database is the real address.
CHAPTER 5
Process Super Step
5.1. Objectives:
The objective of this chapter to learn Time-Person-Object-Location-Event(T-P-O-L-
E) design principle and various concepts that are use to create/define relationship among this
data.
5.2. Introduction:
The Process superstep uses the assess results of the retrieve versions of the data
sources into a highly structured data vault. These data vaults form the basic data structure for
the rest of the data science steps.
21
Fig7: Categories of Data
5.3.1. Hubs:
Data vault hub is used to store business key. These keys do not change over
time. Hub also contains a surrogate key for each hub entry and metadata information
for a business key.
5.3.2. Links:
Data vault links are join relationship between business keys.
5.3.3. Satellietes:
Data vault satellites stores the chronological descriptive and characteristics for
a specific section of business data. Using hub and links we get model structure but no
chronological characteristics. Satellites consist of characteristics and metadata linking
them to their specific hub.
5.3.4. Reference Satellites:
Reference satellites are referenced from satellites that can be used by other
satellites to prevent redundant storage of reference characteristics.
22
Fig8: Time-
Person-Object-
Location-Event high-level design
23
Fig9: Time Link
Following are the time links that can be stored as separate links.
Time-Person Link
Time-Object Link
Time-Location Link
Time-Event Link
Time satellite can be used to move from one time zone to other very easily.
This feature will be used during Transform super step.
Person section contains data structure to store all data related to person.
5.6.1. Person Hub:
Following are the fields of Person hub.
24
Person Links connect person hub to other hubs.
Following are the person links that can be stored as separate links.
Person-Time Link
Person-Object Link
Person-Location Link
Person-Event Link
25
Object hub represent a real-world object with few attributes. Following
are the fields of object hub.
Following are the object links that can be stored as separate links.
Object-Time Link
Object-Person Link
Object-Location Link
Object-Event Link
26
5.8. Location Section:
Location section contains data structure to store all data related to location.
The location hub consists of a series of fields that supports a GPS location.
The locationhub consists of the following fields:
Following are the location links that can be stored as separate links.
Location-Time Link
27
Location-Person Link
Location-Object Link
Location-Event Link
Event hub contains various fields that stores real world events.
28
Fig13: Event Link
Following are the time links that can be stored as separate links.
Event-Time Link
Event-Person Link
Event-Object Link
Event-Location Link
Event satellites are part of vault it contains event information that occurs in the
system.
29
Appendix A
Date:
Name of the Intern :
Reg No :
Branch :
Attributes
Attendance
(Punctuality)
Productivity
(Volume, Promptness)
Quality of Work
(Accuracy,
Completeness,
Neatness)
Initiative
(Self-Starter, Resourceful)
30
Attitude
(Enthusiasm, Desire to
Learn)
Interpersonal Relations
(Cooperative, Courteous,
Friendly)
Ability to Learn
(Comprehension of New
Concepts)
Communications Skills
(Written and Oral
Expression)
Judgement
(Decision Making)
Areas where student gained new skills, insights, values, confidence, etc.
Points
Awarded
Overall Evaluation of the Intern’s Performance
Evaluation Scale:
Points
32
Appendix B
33
Conduct investigations of complex Investigation of various
problems: Research based knowledge problems of farmers
PO4 and research methods including design
of experiments, analysis and
interpretation of data and synthesis
of information to provide valid
conclusions.
Modern tool usage: Create, select and Used many of the Tremendous
apply appropriate techniques, resources tools for Development Process
PO5 and modern engineering and it tools
including prediction and modelling to
complex engineering activities with an
understanding of the limitations.
34
Communication: Communicate Prepared & documented summer
effectively on complex engineering internship report on Technology
PO10 activities with the engineering Entrepreneurship Program
community and with society at large,
such as, being able to comprehend and
write effective reports and design
documentation, make effective
presentations, and give and receive clear
instructions.
multidisciplinary environments.
PSO2 ----
35