0% found this document useful (0 votes)
199 views48 pages

Altair Data Science Internship

The document is an internship report by P. Yaso Krishna, detailing his experience during the Altair Data Science Virtual Internship at Sasi Institute of Technology & Engineering from April to June 2024. It covers the core concepts, tools, and techniques in data science, including data cleaning, analysis, and visualization using popular libraries like Python and Scikit-learn. The report also outlines the educational objectives and outcomes of the Computer Science & Technology program at the institute.

Uploaded by

yaso.ponnaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views48 pages

Altair Data Science Internship

The document is an internship report by P. Yaso Krishna, detailing his experience during the Altair Data Science Virtual Internship at Sasi Institute of Technology & Engineering from April to June 2024. It covers the core concepts, tools, and techniques in data science, including data cleaning, analysis, and visualization using popular libraries like Python and Scikit-learn. The report also outlines the educational objectives and outcomes of the Computer Science & Technology program at the institute.

Uploaded by

yaso.ponnaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

ALTAIR DATA SCIENCE MASTER VIRTUAL

AN INTERNSHIP REPORT

Submitted by

P. YASO KRISHNA
Register No: 21K61A0648

In Partial Fulfillment of the Requirements For The Degree of

BACHELOR OF TECHNOLOGY

In

COMPUTER SCIENCE & TECHNOLOGY

NOVEMBER 2024

i
Certificate

ii
BONAFIDE CERTIFICATE

I hereby declare that the internship report titled "Altair Data Science Virtual Internship" is a record of
original work completed by me during my internship period at Sasi Institute of Technology &
Engineering from the month of April to June. The report is based on my experience and insights gained
from doing the internship course.

I confirm that this report has not been submitted previously in any form to any other university,
institute, or publication for any purpose. I have also ensured that all information and
observations are accurately represented to the best of my knowledge, and any reference to data,
sources, or colleagues' work is duly acknowledged.

P. YASO KRISHNA DR.UMA.K


Name of the student Faculty Incharge
Associate Professor
Department of CST
SITE

iii
DECLARATION

I, P. YASO KRISHNA, 21K61A0648 student of computer science and technology at Sasi


Institute of Technology & Engineering, Tadepalligudem hereby declare that the Summer
Training Report entitled “Altair Data Science Virtual Internship" is an authentic record of my
own work as requirements of Industrial Training during the April 2024 to June 2024. I
obtained the knowledge of AWS Cloud through the selfless efforts of the Employee arranged
to me by administration. A Training Report was made on the same and the suggestions are
given by the faculty were duly incorporated.

P. YASO KRISHNA

21K61A0648

INCHARGE HEAD OF THE DEPARTMENT

INTERNAL EXAMINER EXTERNAL EXAMINER

iv
ABSTRACT

It provides an immersive, hands-on experience designed to introduce participants to the core


concepts, tools, and techniques used in data science. Throughout the program, interns will
work on real-world datasets and gain practical experience in data cleaning, exploration,
analysis, and visualization. The internship covers key areas such as statistical analysis,
machine learning, data wrangling, and model evaluation using popular tools and libraries
like Python, Pandas, NumPy, Matplotlib, and Scikit-learn. Participants will also learn how to
communicate data-driven insights through reports and presentations, fostering a deeper
understanding of the end-to-end data science workflow. The program is structured to help
interns develop both technical skills and problem-solving abilities, preparing them for a
successful career in data science.

v
ACKNOWLEDGEMENT

First and foremost, I would like to thank the Lord Almighty for giving me the strength and
knowledge, ability and opportunity to undertake this internship and to preserve it and also
complete it satisfactorily.
I would like to express my heartiest gratitude and thanks to my supervisor Dr.K.Uma,
Associate Professor, Department of Computer Science & Technology, Sasi Institute of
Technology, Tadepalligudem for her valuable guidance, continuous support and
encouragement at all stages of this research. The successful and timely completion of this
internship is only possible through her inspiration and constructive comments.
I owe my gratitude to Dr.Mohamad Ismail, Principal, Sasi Institute of Technology,
Tadepalligudem, for his sustained support. I express my heartfelt thanks to Dr.P.Kiran Kumar,
Professor & Head, , Department of Computer Science & Technology, Sasi Institute of
Technology, Tadepalligudem for his kind support in persuing this internship.

P. YASO KRISHNA

vi
Vision & Mission

Vision of the Institute


 Confect as a premier institute for professional education by creating technocrats who
can address the society`s needs through inventions and innovations.

Mission of the Institute


 Partake in the national growth of technological, industrial with
societal responsibilities.
 Provide an environment that promotes productive research.
 Meet stakeholder’s expectations through continue and sustained quality
improvements.

Vision of the Program


 To become recognized Centre of Excellence for quality IT Education and create
professionals with ability to solve social needs. 

Mission of the Program


 To provide quality teaching learning environment that build necessary skills for
employability and career development.
 To conduct trainings/events for overall development of stakeholders with
collaborations.
 To impart value education to students to serve society with high integrity and good
character
 Provide state of the art facilities to enable innovation, student centric learning

vii
PEO’S and PSO

Program Educational Objectives


These PEO’s are meant to prepare our students to thrive and to lead in their career.
Our graduates will be able

Graduates will have strong knowledge about IT applications with


leadership Qualities
P1

Graduates will pursue successful career in IT and allied industries and


provide solutions for global needs
P2

P3 Graduates with life-long learning attitude and practice professional ethics

Program Outcomes
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex
engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modelling to complex
engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.

viii
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member
or leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.

Program Specific Outcomes


1. Application Development: Develop risk free innovative IT applications for industrial
needs.

2. Successful Career and Entrepreneurship: Explore technical knowledge in diverse


areas of IT and experience an environment conducive in cultivating skills for
successful career, entrepreneurship and higher studies.

ix
List of Contents Topic
Pg. No

Chapter 1: Data Science Technology Stack


1.1. Rapid Information Factory (RIF) Ecosystem 1
1.2. Data Science Storage Tools 2
1.3. Data Lake 2
1.4. Data Vault 3
1.4.1. Hub 3
1.4.2. Link 3
1.4.3. Satellites 3
1.5. Data Science Processing Tools 4
1.5.1. Spark 4
1.5.2. Spark Core 5
1.5.3. Spark SQL 6
1.5.4. Spark Streaming 6
1.5.5. Graphx 7
1.6. Different Programming languages in data science 7
Processing
1.6.1. Elastic Search 8
1.6.2. R Language 8
1.6.3. Scala 8
1.6.4. Python 9

Chapter 2: Three Management Layers


2.1. Introduction 10
2.2. Operational Management Layer 10
2.3. Audit, Balance, and Control Layer 10
2.3.1. Audit 10
2.3.2. Balance 11
2.3.3. Control 11

2.4. Yoke Solution 11


2.4.1. Producer 11

x
2.4.2. Consumer 12
2.4.3. Directed Acyclic Graph Scheduling 12
2.5. Functional Layer 12
2.6. Data Science Process 13

Chapter 3: Retrieve Super Step


3.1. Introduction 14
3.2. Data Lakes 14
3.3. Data Swamps 14
3.3.1. Start with Concrete Business Questions 15
3.3.2. Data Quality 15
3.3.3. Audit and Version Management 15
3.3.4. Data Governance 15
3.4. Training the Trainer Model 15
3.5. Shipping Technologies 16
3.5.1. Shipping Terms 16
3.5.2. Incoterm 2010 16
3.6. Other Data Sources / Stores 17

Chapter 4: Assess Super Step


4.1. Introduction 18
4.2. Errors 18
4.2.1. Accept the Error 18
4.2.2. Reject the Error 18
4.2.3. Create a Default Value 19
4.3. Analysis of Data 19
4.3.1. Completeness 20
4.3.2. Consistency 20
4.3.3. Conformity 20
4.3.4. Accuracy 20

Chapter 5: Process Super Step


5.1. Objectives 21
5.2. Introduction 21
5.3. Data Vault 21
5.3.1. Hubs 21
x
5.3.2. Links 21
5.3.3. Satellites 22
5.3.4. Reference Satellites 22
5.4. Time-Person-Object-Location-Event Data Vault 22
5.5. Time Section 22
5.5.1. Time Hub 22
5.5.2. Time Links 23
5.5.3. Time Satellites 23
5.6. Person Section 23
5.6.1. Person Hub 24
5.6.2. Person Links 24
5.6.3. Person Satellites 24
5.7. Object Section
5.7.1. Object Hub 25
5.7.2. Object Links 25
5.7.3. Object Satellites 26
5.8. Location Section 26
5.8.1. Location Hub 26
5.8.2. Location Links 26
5.8.3. Location Satellites 27
5.9. Event Section 27
5.9.1. Event Hub 27
5.9.2. Event Links 28
5.9.3. Event Satellites 28

x
List of Figures
Pg. No

1. Representation of data lake 3

2. Data Vault 4

3. Representation of Apache Spark Ecosystem 5

4. Apache Spark Core Representation 6

5. Flowchart of Spark Streaming 7

6. Analysis of Data 19

7. Categories of Data 21

8. Time-Person-Object-Location-Event high- 22

level design

9. Time Link 23

10. Person Link 24

11. Object Link 25

12. Location Link 27

13. Event Link 28

xiii
CHAPTER 1
Data Science Technology Stack
1.1. Rapid Information Factory (RIF) Ecosystem:
Rapid Information Factory (RIF) System is a technique and tool which is used for
processing the data in the development. The Rapid Information Factory is a massive parallel
data processing platform capable of processing theoretical unlimited size data sets.

The Rapid Information Factory (RIF) platform supports five high-level layers:

 Functional Layer:
The functional layer is the core processing capability of the factory.
Core functional data processing methodology is the R-A-P-T-O-R framework.

1. Retrieve Super Step:


The retrieve super step supports the interaction between external data sources
and the factory.

2. Assess Super Step:


The assess super step supports the data quality clean-up in the factory.

3. Process Super Step:


The process super step converts data into data vault.

4. Transform Super Step:


The transform super step converts data vault via sun modeling into
dimensional modeling to form a data warehouse.

5. Organize Super Step:


The organize super step sub-divides the data warehouse into data marts.

6. Report Super Step:


The report super step is the Virtualization capacity of the factory.

Common components supporting other layers.

1. Maintenance Utilities

2. Data Utilities

3. Processing Utilities

 Business Layer:
1
Contains the business requirements (Functional and Non-functional

1.2. Data Science Storage Tools:


 Data Science ecosystem has a bunch of series of tools which are used to build your
solution. By using this tools and techniques you will get rapid information in
advanced for its better capability and new development will occur each day.

 There are two basic data processing tools to perform the practical of data science as
given below:

 Schema on write ecosystem.

1. Traditional Relational Database Management System requires a schema before


loading the data. Schema basically denotes the organizational data which is like a
blueprint, describing how the database should be constructed.
2. Schema is a single structure which represents logical view of entire database. It
represents how the data is organized and related between them.
3. It is responsible of the database designer to design the database perfect to
understand the logic and structure with the help of programmer.

 Schema on read ecosystem

1. Schema on read ecosystem does not need schema, without this you can load the
data into the database.
2. It has the capabilities to store the structure, semi-structure, unstructured data and it
has potential to apply most of the flexibilities when we request the query during
the execution.

1.3. Data Lake:


 A Data Lake is storage repository of large amount of raw data that means structure,
semi-structure, unstructured data.

 This is the place where you can store three types of data structure, semi-structure,
unstructured data with no fix amount of limit and storage to store the data.

 Data Lake follow to store less data into the structure database because it follows the
schema on read process architecture to store the data.

 Data Lake allow us to transform the raw data that means structure, semi-structure,
unstructured data into the structure data format so that SQL query could be performed
for the analysis.

 Data Lake is similar to real time river or lake where the water comes from different-
different places and at the last all the small- small river and lake are merged into the
big river or lake where large amount of water are stored, whenever there is need of
water then it can be used by anyone.
2
Fig1: Representation of data lake

1.4. Data Vault:


 Data Vault is a database modeling method which is designed to store the long-term
historical storage amount of data and it can controlled by using the data vault.

 Data Lake and Data Vault are built by using the three main component or structure of
data i.e. Hub, Link and satellite.

1.4.1. Hub:
Hub has unique business key with low amount of data to be changed and meta
data that means data is the main source of generating the hubs.

1.4.2. Link:
 Hub has unique business key with low amount of data to be changed and
meta data that means data is the main source of generating the hubs.

 Link represent and connect only element in the business relationships


because when one node orlink relate to one or another link on that time data
transfers smoothly.

1.4.3. Satellites:
3
 When the hubs and links produce and form the structure of satellites which
store no chronological structure of data means then it would not provide the
information about the mean, median, mode, maximum, minimum, sum of
the data.

 Satellites are the strong structure of data that store a detailed information
about the related data or business characteristics key and stores large
volume of data vault.

 The combinations of all these three i.e. hub, link, and satellites are formed
together to help the data analytics and data scientists and data engineer to
store the business structure, types of information or data into it.

Fig2: Data Vault

1.5. Data Science Processing Tools:


 The process of transforming the data, data lake to data vault and then transferring the
data vault into data warehouse.

 Most of the data scientist and data analysis, data engineer uses these data science
processing tool to process and transfer the data vault into data warehouse.

1.5.1. Spark:

1. Apache Spark is an open source clustering computing framework. The word


open source means it is freely available on internet and just go on internet
4
and type apache spark and you can getfreely source code, you can download
and use according to your wish.

2. Apache Spark has the capabilities and potential, process all types and variety
of data with repositories including Hadoop distributed file system, NoSQL
database as well as apache spark.

3. Apache Spark provide an interface for the programmer and developer to


directly interact with the system and make data parallel and compatible with
data scientist and data engineer.

Fig3: Representation of Apache Spark Ecosystem

1.5.2. Spark Core:


1. Spark Core is base and foundation for over all of the project development
and provide some most important Information like distributed task,
dispatching, scheduling and basic Input and output functionalities.

2. By using spark core, you can have more complex queries that will help us to
work with complex environment.

3. The distributed nature of spark ecosystem enables you the same processing
data on a small cluster, to go for hundreds or thousand of nodes without
making any changes.

4. Apache Spark Core has come up with advanced analytics that means it does
not support the map and reduce the potential and capabilities to support SQl
Queries, Machine Learning and graph Algorithms.
5
Fig4: Apache Spark Core Representation

1.5.3. Spark SQL:


1. Spark SQL is a component on top of the Spark Core that presents data
abstraction called Data Frames.

2. Spark SQL is fast clustering data abstraction, so that data manipulation can
be done for fast computation.

3. It enables the user to run SQL/HQL on top of the spark and by suing this, we
can process the structure, unstructured and semi structured data.

4. Apache Spark SQL provide a much relationship between relational database


and procedural processing. This comes, when we want to load the data from
traditional way into data lake ecosystem.

1.5.4. Spark Streaming:


1. Apache Spark Streaming enables powerful interactive and data analytics
application for live streaming data. In Streaming, data is not fixed and data
comes from different source continuously.

2. Stream divide the incoming input data into the small-small unit of data for
further data analytics and data processing for next level.

3. There are multilevel of processing involved in it. Live streaming data are
received and divided into small-small parts or batches and these small-small
of data or batches are then processed or mixed by the spark engine to
generate or produced the final level of streaming of data.

6
4. Processing of data in the system in Hadoop has very high latency means that
data is not received on timely manner and it is not suitable for real time
processing requirement.

5. Processing of data is generated by storm, if it is not happened again. But this


type of mistake and latency give the data loss and repetition of records
processing.

Fig5: Flowchart of Spark Streaming

1.5.5. GraphX:
1. GraphX is very powerful graph processing tool application programming
interface for apache spark analytics engine.

2. GraphX is a new component in a spark for graphs and graphs-parallel


computation.

3. GraphX follow the ETL process that means Extract, transform and Load,
exploratory analysis, iterative graph computation within a single system.

4. The usage can be seen in the Facebooks, LinkedIn connection, google map,
and internet routers use these types of tool for better response and analysis.

5. Graph is an abstract data types that means it is used to implement the


directed and undirected graph concepts from the mathematics in the graph
theory concept.

1.6. Different Programming languages in data science processing:

1.6.1. Elastic Search:


 Elastic Search is an open source, distributed search engine.
7
 Scalability mean that it can scale any point of view, reliability means
that it should be trustable, stress free management.

 It is used for the replacement of the documents and data store in the
database like mongo dB etc.

 Elastic search is one of the popular search engines and mostly used by
the recent organization like google, stack Overflow, GitHub and much
more.

1.6.2. R Language:

 R is a programming language and it is used for statistical


computing and graphics purpose.

 R Language are used by data engineer, data scientist,


statisticians, and data miners for developing the software and
performing data analytics.

 The related packages are of R Language is sqldf, forecast,


dplyr, stringer, lubridate, ggplot2, reshape etc.

 R language is freely available, and it comes with General


Public License and it supports many of the platform like
windows, Linux/Unix, Mac.

 R language has built in capability to support and can be


implemented and integrated with procedural language
written in c, c++, java, .Net, and python.

1.6.3. Scala:
 Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.

 Most of the data science project and framework are build by using the
Scala programming language because it has so many capabilities and
potential to work with it.

8
 Scala integrate the feature of object-oriented language and its function
because Scala can be written in java, c++, python language.

1.6.4. Python:
 Python is a programming language and it can used on a server to create
web application.

 Python can be used for web development, mathematics, software


development and it is used to connect the database and create and
modify the data.

 Python can handle the large amount of data and it is capable and
potential to perform the complex task on data.

 Python is reliable, portable, and flexible to work on different platform


like windows, mac and Linux etc.

 Python support object-oriented programming language, functional and


work with structure data.

9
CHAPTER 2
Three Management Layers

2.1. Introduction:
The Three Management Layers are a very important part of the framework. They
watch the overall operations in the data science ecosystem and make sure that things are
happening as per plan.

2.2. Operational Management Layer:


 The Three Management Layers are a very important part of the framework. They
watch the overall operations in the data science ecosystem and make sure that things
are happening as per plan.

 This layer is the center for complete processing capability in the data science
ecosystem.

 This layer stores what you want to process along with every processingschedule and
workflow for the entire ecosystem.

 We record the following in the operations management layer:

1. Definition and Management of Data Processing stream

2. Eco system Parameters

3. Overall Process Scheduling

4. Overall Process Monitoring

5. Overall Communication

6. Overall Alerting

2.3. Audit, Balance, and Control Layer:

10
2.3.1. Audit:
 An audit refers to an examination of the ecosystem that is systematic
and independent.

 This sublayer records which processes are running at any given


specific point within the ecosystem.

 Data scientists and engineers use this information collected to better


understand and plan future improvements to the processing to be done.

 The audit in the data science ecosystem, contain of a series of


observers which record prespecified processing indicators related to
the ecosystem.

2.3.2. Balance:

 The balance sublayer has the responsibility to make sure that the data
science environment is balanced between the available processing
capability against the required processing capability or has the ability
to upgrade processing capability during periods of extreme processing.

 In such cases the on-demand processing capability of a cloud


environment becomes highly desirable.

2.3.3. Control:

 The execution of the current active data science processes is controlled


by the control sublayer.

 When processing pipeline encounters an error, the control sublayer


attempts a recovery as per our prespecified requirements else if
recovery does not work out it will schedule a cleanup utility to undo
the error.

 The cause-and-effect analysis system is the core data source for the
distributed control system in the ecosystem.

2.4. Yoke Solution:

 The yoke solution is a custom design.

 Apache Kafka is developed as an open source stream processing platform.


11
Its function is to deliver a platform that is unified, has high-throughput and
low-latency for handling real-time data feeds.

 Kafka provides a publish-subscribe solution that can handle all activity-stream


data and processing. The Kafka environment enables you to send messages
between producers and consumers.

2.4.1. Producer:

 The producer is the part of the system that generates the requests for data
science processing, by creating structures messages for each type of data
science process it requires.

 The producer is the end point of the pipeline that loads messages into Kafka.

2.4.2. Consumer:

 The consumer is the part of the process that takes in messages and organizes
them for processing by the data science tools.

 The consumer is the end point of the pipeline that offloads the messages from
Kafka.

2.4.3. Directed Acyclic Graph Scheduling:

 This solution uses a combination of graph theory and publish subscribe stream
data processing to enable scheduling.

 You can use the Python NetworkX library to resolve any conflicts, by simply
formulating the graph into a specific point before or after you send or receive
messages via Kafka.

2.4.4. Cause-and-Effect Analysis System:

 The cause-and-effect analysis system is the part of the ecosystem that collects
all the logs, schedules, and other ecosystem-related information and enables
data scientists to evaluate the quality of their system.

2.5. Functional Layer:

12
 The functional layer of the data science ecosystem is the largest and most
essential layer for programming and modeling.

2.6. Data Science Process:


 Begin process by asking a What if question – Decide what you want to
know, even if it is only the subset of the data lake you want to use for your
data science, which is a good start.

 Create a hypothesis by putting together observations – Use your


experience or insights to guess a pattern you want to discover, to uncover
additional insights from the data you already have.

 Gather Observations and Use Them to Produce a Hypothesis – A


hypothesis, it is a proposed explanation, prepared on the basis of limited
evidence, as a starting point for further investigation.

 Verify the hypothesis using real-world evidence - Now, we verify our


hypothesis by comparing it with real-world evidence.

13
CHAPTER 3
Retrieve Super Step
3.1. Introduction:
 Verify the hypothesis using real-world evidence- Now, we verify our hypothesis by
comparing it with real-world evidence

 The Retrieve super step is the first contact between your data science and the source
systems.

 The successful retrieval of the data is a major stepping-stone to ensuring that you are
performing good data science.

 Data lineage delivers the audit trail of the data elements at the lowest granular level,
to ensure full data governance.

 Data governance supports metadata management for system guidelines, processing


strategies, policies formulation, and implementation of processing.

3.2. Data Lakes:


 A company’s data lake covers all data that your business is authorized to process, to
attain an improved profitability of your business’s core accomplishments.

 The data lake is the complete data world your company interacts with during its
business life span.

 In simple terms, if you generate data or consume data to perform your business tasks,
that data is in your company’s data lake.

14
 Just as a lake needs rivers and streams to feed it, the data lake will consume an
unavoidable deluge of data sources from upstream and deliver it to downstream
partners

3.3. Data Swamps:


 Data swamps are simply data lakes that are not managed.

 They are not to be feared. They need to be tamed.

 Following are four critical steps to avoid a data swamp.


1. Start with Concrete Business Questions
2. Data Quality
3. Audit and Version Management
4. Data Governance
3.3.1. Start with Concrete Business Questions:

 Simply dumping a horde of data into a data lake, with no tangible purpose in
mind, will result in a big business risk.

 The data lake must be enabled to collect the data required to answer your
business questions.

 It is suggested to perform a comprehensive analysis of the entire set of data


you have and then apply a metadata classification for the data, stating full data
lineage for allowing it into the data lake.

3.3.2. Data Quality:


 More data points do not mean that data quality is less relevant.

 Data quality can cause the invalidation of a complete data set, if not dealt with
correctly.

3.3.3. Audit and Version Management:


 You must always report the following:
1. Who used the process?
2. When was it used?
3. Which version of code was used?

3.3.4. Data Governance:


 The role of data governance, data access, and data security does not go away
with the volume of data in the data lake.

15
 It simply collects together into a worse problem, if not managed.

3.4. Training the Trainer Model:


 To prevent a data swamp, it is essential that you train your team also. Data science is a
team effort.

 People, process, and technology are the three cornerstones to ensure that data is
curated and protected.

 You are responsible for your people; share the knowledge you acquire from this book.
The process I teach you, you need to teach them. Alone, you cannot achieve success.

3.5. Shipping Technologies:


3.5.1. Shipping Terms:
 These determine the rules of the shipment, the conditions under which it is
made. Normally, these are stated on the shipping manifest.

 Following are the terms used:

1. Seller - The person/company sending the products on the shipping


manifest is the seller. This is not a location but a legal entity sending the
products.

2. Carrier - The person/company that physically carries the products on the


shipping manifest is the carrier. Note that this is not a location but a legal
entity transporting the products.

3. Port - A Portis any point from which you have to exit or enter a country.
Normally, these are shipping ports or airports but can also include border
crossings via road. Note that there are two ports in the complete process.
This is important. There is a port of exit and a port of entry.

4. Ship - Ship is the general term for the physical transport method used
for the goods. This can refer to a cargo ship, airplane, truck, or even
person, but it must be identified by a unique allocation number.

3.5.2. Incoterm 2010:


 Incoterm 2010 is a summary of the basic options, as determined and published
by a standard board.
16
 This option specifies which party has an obligation to pay if something
happens to the product being shipped.

3.6. Other Data Sources / Stores:


 While performing data retrieval you may have to work with one of the following data
stores.

 SQLite - This requires a package named sqlite3.

 Microsoft SQL Server - Microsoft SQL server is common in companies, and this
connector supports your connection to the database.

 MySQL - MySQL is widely used by lots of companies for storing data. This opens
that data to your data science with the change of a simple connection string.

 Apache Cassandra - Cassandra is becoming a widely distributed database engine in


the corporate world.

 Pydoop - It is a Python interface to Hadoop that allows you to write MapReduce


applications and interact with HDFS in pure Python.

 Microsoft Excel - Excel is common in the data sharing ecosystem, and it enables you
to load files using this format with ease.

17
CHAPTER 4
Assess Super Step

4.1. Introduction:
Data quality problems result in a 20% decrease in worker productivity and explain
why 40% of business initiatives fail to achieve set goals. Incorrect data can harm a reputation,
misdirect resources, slow down the retrieval of information, and lead to false insights and
missed opportunities.
For example, if an organization has the incorrect name or mailing address of a
prospective client, their marketing materials could go to the wrong recipient. If sales data is
attributed to the wrong SKU or brand, the company might invest in a product line with less
than stellar customer demand.
Data profiling is the process of examining, analyzing and reviewing data to collect
statistics surrounding the quality and hygiene of the dataset. Data quality refers to the
accuracy, consistency, validity and completeness of data. Data profiling may also be known
as data archeology, data assessment, data discovery or data quality analysis.

4.2. Errors:
Errors are the norm, not the exception, when working with data. By now, you’ve
probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot safely

18
assume that any of the data we work with is error-free, our mission should be to find and
tackle errors in the most efficient way possible.

4.2.1. Accept the Error:


If an error falls within an acceptable standard (i.e., Navi Mumbai instead of
Navi Mum.), then it could be accepted and move on to the next data entry. But
remember that if you accept the error, you will affect data science techniques and
algorithms that perform classification, such as binning, regression, clustering, and
decision trees, because these processes assume that the values in this example are not
the same. This option is the easy option, but not always the best option.

4.2.2. Reject the Error:


Unless the nature of missing data is ‘Missing completely at random’, the best
avoidable method in many cases is deletion.
User Device OS Transactions
A Mobile Android 5
B Mobile Windows 4
C Tablet NA 3
D NA Android 2
E Mobile IOS 1

4.2.3. Create a Default Value:


NaN is the default missing value marker for reasons of computational speed
and convenience. This is a sentinel value, in the sense that it is a dummy data or flag
value that can be easily detected and worked with using functions in pandas.

4.3. Analysis of Data:


One of the causes of data quality issues is in source data that is housed in a patchwork
of operational systems and enterprise applications. Each of these data sources can have
scattered or misplaced values, outdated and duplicate records, and inconsistent (or undefined)
data standards and formats across customers, products, transactions, financials and more.
Data quality problems can also arise when an enterprise consolidates data during a
merger or acquisition. But perhaps the largest contributor to data quality issues is that the data
are being entered, edited, maintained, manipulated and reported on by people. To maintain
the accuracy and value of the business-critical operational information that impact strategic
decision-making, businesses should implement a data quality strategy that embeds data

19
quality techniques into their business processes and into their enterprise applications and data
integration.

Fig6: Analysis of Data

4.3.1. Completeness:
Completeness is defined as expected comprehensiveness. Data can be
complete even if optional data is missing. As long as the data meets the expectations
then the data is considered complete. For example, a customer’s first name and last
name are mandatory but middle name is optional; so a record can be considered
complete even if a middle name is not available.

4.3.2. Consistency:
Consistency means data across all systems reflects the same information and
are in synch with each other across the enterprise.

4.3.3. Conformity:
Conformity means the data is following the set of standard data definitions
like data type, size and format. For example, date of birth of customer is in the format
“mm/dd/yyyy”.

20
4.3.4. Accuracy:
Accuracy is the degree to which data correctly reflects the real world object or
an event being described. Examples:Sales of the business unit are the real value.
Address of an employee in the employee database is the real address.

CHAPTER 5
Process Super Step

5.1. Objectives:
The objective of this chapter to learn Time-Person-Object-Location-Event(T-P-O-L-
E) design principle and various concepts that are use to create/define relationship among this
data.

5.2. Introduction:
The Process superstep uses the assess results of the retrieve versions of the data
sources into a highly structured data vault. These data vaults form the basic data structure for
the rest of the data science steps.

21
Fig7: Categories of Data

5.3. Data Vault:


Data Vault modelling is a technique to manage long term storage of data from
multiple operation system. It stores historical data in the database.

5.3.1. Hubs:
Data vault hub is used to store business key. These keys do not change over
time. Hub also contains a surrogate key for each hub entry and metadata information
for a business key.
5.3.2. Links:
Data vault links are join relationship between business keys.

5.3.3. Satellietes:
Data vault satellites stores the chronological descriptive and characteristics for
a specific section of business data. Using hub and links we get model structure but no
chronological characteristics. Satellites consist of characteristics and metadata linking
them to their specific hub.
5.3.4. Reference Satellites:
Reference satellites are referenced from satellites that can be used by other
satellites to prevent redundant storage of reference characteristics.

5.4. Time-Person-Object-Location-Event Data Vault:


We will use Time-Person-Object-Location-Event (T-P-O-L-E) design principle. All
five sections are linked with each other, resulting into sixteen links.

22
Fig8: Time-
Person-Object-
Location-Event high-level design

5.5. Time Section:


Time section contain data structure to store all time related information. For example,
time at which event has occurred.
5.5.1. Time Hub:
This hub act as connector between time zones. Following are the fields of time
hub.

5.5.2. Time Links:


Time Links connect time hub to other hubs.

23
Fig9: Time Link
Following are the time links that can be stored as separate links.
 Time-Person Link

 Time-Object Link

 Time-Location Link

 Time-Event Link

5.5.3. Time Satellites:

Following are the fields of time satellites.

Time satellite can be used to move from one time zone to other very easily.
This feature will be used during Transform super step.

5.6. Person Section:

Person section contains data structure to store all data related to person.
5.6.1. Person Hub:
Following are the fields of Person hub.

5.6.2. Person Links:

24
Person Links connect person hub to other hubs.

Fig10: Person Link

Following are the person links that can be stored as separate links.

 Person-Time Link

 Person-Object Link

 Person-Location Link

 Person-Event Link

5.6.3. Person Satellites:


Person satellites are part of vault. Basically, it is information about birthdate,
anniversary or validity dates of ID for respective person.

5.7. Object section:


Object section contains data structure to store all data related to object.

5.7.1. Object Hub:

25
Object hub represent a real-world object with few attributes. Following
are the fields of object hub.

5.7.2. Object Links:


Object Links connect object hub to other hubs.

Fig11: Object Link

Following are the object links that can be stored as separate links.

 Object-Time Link

 Object-Person Link

 Object-Location Link

 Object-Event Link

5.7.3. Object Satellites:


Object satellites are part of vault. Basically, it is information about ID,UUID,
type, key, etc. for respective object.

26
5.8. Location Section:

Location section contains data structure to store all data related to location.

5.8.1. Location Hub:

The location hub consists of a series of fields that supports a GPS location.
The locationhub consists of the following fields:

5.8.2. Location Links:

Location Links connect location hub to other hubs.

Fig12: Loaction Link

Following are the location links that can be stored as separate links.

 Location-Time Link
27
 Location-Person Link

 Location-Object Link

 Location-Event Link

5.8.3. Location Satelliets:

Location satellites are part of vault that contains locations of entities.

5.9. Event Section:


It contains data structure to store all data of entities related to event that has occurred.

5.9.1. Event Hub:

Event hub contains various fields that stores real world events.

5.9.2. Event Links:

Event Links connect event hub to other hubs.

28
Fig13: Event Link

Following are the time links that can be stored as separate links.

 Event-Time Link

 Event-Person Link

 Event-Object Link

 Event-Location Link

5.9.3. Event Satelliets:

Event satellites are part of vault it contains event information that occurs in the
system.

29
Appendix A

INDUSTRIAL INTERNSHIP EVALUATION FORM


For the Students of B.Tech. (IT), Sasi Institute of
Technology &Engineering, Tadepalligudem, West Godavari
District, AndhraPradesh

Date:
Name of the Intern :

Reg No :

Branch :

Internship offered : APRIL-JUNE 24

Evaluate this student intern on the following parameters by


checking the appropriate attributes.

Attributes

Give Your Feedback with Tick


Evaluati Mark (√)
on
Paramet Excellent Very Good Go Satisfactory Poor
es od

Attendance

(Punctuality)

Productivity

(Volume, Promptness)

Quality of Work

(Accuracy,
Completeness,
Neatness)
Initiative

(Self-Starter, Resourceful)

30
Attitude
(Enthusiasm, Desire to
Learn)

Interpersonal Relations
(Cooperative, Courteous,
Friendly)

Ability to Learn
(Comprehension of New
Concepts)

Use of Academic Training


(Applies Education to
Practical Usage)

Communications Skills
(Written and Oral
Expression)

Judgement
(Decision Making)

Please summarize. Your comments are especially helpful.

Areas where student excels:

Areas where student needs to improve:

Areas where student gained new skills, insights, values, confidence, etc.

Was student’s academic preparation sufficient for this internship?


31
Additional comments or suggestions for the student:

Points
Awarded
Overall Evaluation of the Intern’s Performance

(Evaluation Scale shown below)

Evaluation Scale:

Attributes Excellent Very Good Good Satisfactory Poor

Points

Signature of the Guide/Supervisor :


Name of the Guide/Supervisor :
Designation :

32
Appendix B

PO's and PSO's relevance with Internship


Work

Program outcomes Relevance


Engineering Knowledge: Apply Applied basic knowledge of
knowledge of mathematics ,science, engineering to understand
PO1 engineering fundamentals and an about entrepreneurship
engineering specialization to the
solution
of complex engineering problems
Problem Analysis: Identify, formulate Performed research in various
research literature and analyze complex ways to analyze problems and
PO2 engineering problems reaching find a solution
substantiated conclusions using first
principles of mathematics, natural
sciences and engineering sciences.
Design/development of solutions: Able to understand the market
Design solutions for complex strategies and problems in the
PO3 engineering problems and design society
systems components or processes that
meet specified need with appropriate
consideration for public health and
safety, cultural, societal and
environmental
considerations.

33
Conduct investigations of complex Investigation of various
problems: Research based knowledge problems of farmers
PO4 and research methods including design
of experiments, analysis and
interpretation of data and synthesis
of information to provide valid
conclusions.

Modern tool usage: Create, select and Used many of the Tremendous
apply appropriate techniques, resources tools for Development Process
PO5 and modern engineering and it tools
including prediction and modelling to
complex engineering activities with an
understanding of the limitations.

The engineer and society : Apply It can be Implemented in various


reasoning informed by contextual real-world problems
PO6 knowledge to asses societal, health,
safety, legal and cultural issues and
consequent responsibilities relevant to
professional engineering practice

Environment and sustainability:


Understand the impact of the
PO7 professional engineering solutions in
societal and environmental contexts, and -------------
demonstrate the knowledge and need for
sustainable development

Ethics: Apply ethical principles and Able to identify standard norms


commit to professional ethics and
PO8 responsibilities and
norms of the engineering practice.

Individual and team work: Function It is an Individual/Team work


effectively as an individual, and as a that solves problem through
PO9 member or leader in diverse teams, and technology
in multidisciplinary
settings.

34
Communication: Communicate Prepared & documented summer
effectively on complex engineering internship report on Technology
PO10 activities with the engineering Entrepreneurship Program
community and with society at large,
such as, being able to comprehend and
write effective reports and design
documentation, make effective
presentations, and give and receive clear
instructions.

Project management and finance: It is a one-year training process


Demonstrat knowledge and conducted by Indian School of
PO11 understanding of the engineering and Business With heavy costing.
management principles and apply these
to one’s own work, as a member and
leader in a team, to manage projects and
in

multidisciplinary environments.

Life-long learning: Recognize the need It is a endless learning procedure


for and have the preparation and ability because entrepreneur should
PO12 to engage in independent and life-long learn everyday from everything.
learning in the

broadest context of technological change.

Application Development An application that helps


farmers
PSO1

Successful career and Entrepreneurship

PSO2 ----

35

You might also like