0% found this document useful (0 votes)
19 views

Lecture 1 - Introduction

Uploaded by

study3735
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lecture 1 - Introduction

Uploaded by

study3735
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Data Engineering

Lecture 1

Rokan Uddin Faruqui

Associate Professor
Dept of Computer Science and Engineering
University of Chittaong, Bangladesh
Email: [email protected]
Outline

1 Data Engineer vs Data Scientist

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 2


Data Engineer vs Data Scientist

Outline

1 Data Engineer vs Data Scientist

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 3


Data Engineer vs Data Scientist

Data Engineer

Data Engineers are the link between the management’s data


strategy and the data scientists that need to work with data.

What they do is building the platforms that enable data scientists


to do their magic.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 4


Data Engineer vs Data Scientist

Data Engineer

These platforms are usually used in five different ways:


1 Data ingestion and storage of large amounts of data
2 Algorithm creation by data scientists
3 Automation of the data scientist’s machine learning models and
algorithms for production use
4 Data visualization for employees and customers
5 Most of the time these guys start as traditional solution architects
for systems that involve SQL databases, web servers, SAP
installations and other ”standard” systems.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 5


Data Engineer vs Data Scientist

Data Engineer

But to create big data platforms the engineer needs to be an


expert in specifying, setting up and maintaining big data
technologies like: Hadoop, Spark, HBase, Cassandra,
MongoDB, Kafka, Redis and more.

What they also need is experience on how to deploy systems on


cloud infrastructure like at Amazon or Google or on-premise
hardware.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 6


Data Engineer vs Data Scientist

Data Scientist

use linear algebra and multivariable calculus to create new insight


from existing data.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 7


Data Engineer vs Data Scientist

Data Scientist: Case Study

An industrial company produces a lot of products that need to be


tested before shipping.

Usually such tests take a lot of time because there are hundreds of
things to be tested. All to make sure that your product is not
broken.

Wouldn’t it be great to know early if a test fails ten steps down


the line? If you knew that you could skip the other tests and just
trash the product or repair it.

That’s exactly where a data scientist can help you, big-time. This
field is called predictive analytics and the technique of choice is
machine learning.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 8


Data Engineer vs Data Scientist

Data Scientist: Case Study

You feed an algorithm with measurement data. It generates a


model and optimises it based on the data you fed it with.
That model basically represents a pattern of how your data is
looking. You show that model new data and the model will tell
you if the data still represents the data you have trained it with.
This technique can also be used for predicting machine failure in
advance with machine learning. Of course the whole process is not
that simple.
The actual process of training and applying a model is not that
hard. A lot of work for the data scientist is to figure out how to
pre-process the data that gets fed to the algorithms.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 9


Data Engineer vs Data Scientist

Machine Learning Pipeline

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 10


Data Engineer vs Data Scientist

Data Scientist: Case Study

In order to train an algorithm you need useful data. If you use any
data for the training the produced model will be very unreliable.
An unreliable model for predicting machine failure would tell you
that your machine is damaged even if it is not. Or even worse: It
would tell you the machine is ok even when there is a malfunction.
Model outputs are very abstract. You also need to post-process
the model outputs to receive the outputs you desire.

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 11


Data Engineer vs Data Scientist

Machine Learning Workflow

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 12


Data Engineer vs Data Scientist

Machine Learning Model

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 13


Data Engineer vs Data Scientist

Data Science Platform

Rokan Uddin Faruqui CSE, CU Introduction to Data Engineering 14

You might also like