4 Data Engineering
4 Data Engineering
1
Objectives
In this lecture, you will be
2
Outline
The DATA problem Kinds of databases
3
The DATA Problem
Preparing Data for Analytics is Hard.
4
The Data Problem in Self Service Analytics
Half of the organizations are accessing external data sources.
Data is scattered.
5
The Data Problem in Self Service Analytics
Transactional systems are often not ready for analysis.
Not optimized for analysis
6
The Data Problem in Self Service Analytics
Database needs to be optimized so it becomes
faster to query
free of corrupt data
7
The DATA Problem Solution
In comes the Data Engineer to rescue.
8
What is Data Engineering?
The data engineer is one of the most valuable people in a
data-driven company that wants to scale up.
It is data engineer’s task to make your life as a data scientist
easier.
The data engineer
Gather data from different sources
Optimized database for analyses
Removed corrupt data
9
Tasks of a Data Engineer
The tasks of a data engineer consist of:
developing a scalable data architecture (schema)
)تطوير بنية بيانات قابلة للتطوير (مخطط
10
Tasks of a Data Engineer
Data engineers design, build, and maintain data architectures
for large-scale applications.
This career path requires strong software engineering skills
11
Tasks of a Data Engineer
They're responsible for:
Accessing, collecting, auditing, and cleaning data from
applications and systems into a usable state.
Creating and maintaining efficient databases.
Building data pipelines.
Monitoring and managing systems, including distributed
systems.
12
Tasks of a Data Engineer
To emphasize just how important data engineering is for data
science, take a look at the following hierarchy of needs,
proposed by Monica Rogati.
13
Data Engineer vs Data Scientist
Data Engineer Data Scientist
14
Exercise 1
There are some differences between the tasks of data
scientists and the tasks of data engineers.
15
Exercise 2
Classify the tasks in the correct color. Data engineer (red) or
the data scientist (blue).
Cloud technology
Mining data for patterns
Monitor business processes
Streamline data acquisition
Clean statistical outliers in data
Set up processes to bring together data
Statistical modeling
Develop scalable data architecture
Predictive models using machine learning
Clean corrupt data
16
Exercise 3
Imagine you work in a medium-scale company that hosts an
online market for pet toys. As the company is growing, there
are unmistakably some technical growing pains.
As the first data engineer, you observe some problems and
have to decide where you're best suited to be of help.
17
Tools of the Data Engineer: Database Systems
Data engineers are expert users of database systems.
18
Tools of the Data Engineer: Processing
Tools for quickly processing data
Clean data
Aggregate data
Join data
Huge data have to be processed.
That is where parallel processing comes into play.
Data engineers use clusters of machines to process data.
19
Tools of the Data Engineer
Scheduling tools help to make sure
data moves from one place to another at the correct time, with
a specific intervals.
Resolve dependency requirements of jobs
Jobs run in the right order
20
A Data Pipeline
21
Exercise 4
Can you identify the database in the schematics?
Problem:
self-host data-center
Cover electrical and maintenance costs
Peaks vs. quiet moments: hard to optimize
Solution:
use the cloud
23
Data Storage in the Cloud
Problem:
self-host data-center
Disaster will strike
Need different geographical locations
Solution:
use the cloud
24
The Big Three Cloud Providers
Amazon Web Service (AWS)
Microsoft Azure
Google Cloud
market share
Provider Storage Computation Databases
in 2018
AWS
32% S3 EC2 RDS
Azure
17% Blob Virtual Machines SQL Database
Google Cloud
10% Cloud Storage Compute Engine Cloud SQL
25
Exercise 5
You saw the benefits of using cloud computing as opposed
to self-hosting data centers.
Can you select the most correct statement about cloud
computing?
The cloud can provide you with the resources you need, when
you need them.
26
Resources to Becoming a Data Engineer
Microsoft Certified: Azure Data Engineer Associate
Azure Data Engineers design and implement the
management, monitoring, security, and privacy of data using
the full stack of Azure data services to satisfy business needs.
27
Resources to Becoming a Data Engineer
Udacity Nanodegree Program: Data Engineer
The course goes on to teach in the areas of SQL, Spark, Data
Warehousing on AWS, Apache Airflow etc.
28
Resources to Becoming a Data Engineer
Google Cloud Platform Certification: Professional Data
Engineer
Data Engineering on Google Cloud Platform.
Both the certification and training are short stints and go on to
teach you about using Hadoop, Google BigQuery, and
building scalable machine learning applications on GCP.
29
Resources to Becoming a Data Engineer
Dataquest: Data Engineer
This is a good course for someone beginning their journey
into the data engineering landscape but because of the course
structure it seems to be useful to have some basic Python
knowledge at the least.
The course begins with an introduction to Python and moves
onto SQL which develops further into learning how to use
PostgresSQL and Data Structures and Algorithms.
30
Resources to Becoming a Data Engineer
UC San Diego: Big Data Specialization
The course by the University of California San Diego's
course on Coursera centers around using the Hadoop
framework and Spark and applying these big data handling
techniques in a machine learning instance at the end.
31
Resources to Becoming a Data Engineer
AWS Certified Big Data - Specialty
Because this certification is for advanced users, it requires
you to have a few years experience using AWS and having
other certifications such as AWS Certified Cloud Practitioner
32
Current DE Job Posts
https://twitter.com/MTCyberStaffing
https://referrals.ibm.com/ibmreppify/jobs/214481?bpin=1046&com=2&rin=58344
0
https://redolentech.catsone.com/careers/5101-General/jobs/13183549-Data-
Engineer-SparkFlinkScala-Engineer/
https://www.indeed.com/viewjob?jk=9f7f3b843332d2d3&from=employertweet
https://remotepad.io/remote-jobs/1011-remote-data-engineer
http://www.recruiter-
directory.info/jobs/?q=Big+Data+Engineer&l=Whippany+NJ+USA&z=&tw=&k
=
https://jobs.findyourflex.co.uk/job/big-data-
engineer/?utm_source=twitter&utm_medium=social&utm_campaign=Lloyds_Feb
_20&utm_content=joboftheweek
33
Current DE Job Posts
https://www.modis.com/en-au/jobs/job/north-sydney/big-data-engineer-
/BROADBEAN_564281581389373/
https://provenstaffingsolutions.catsone.com/careers/11273-
General/jobs/12896092-Data-Engineer--Streaming-Platform-Integrator/
https://provenstaffingsolutions.catsone.com/careers/11273-
General/jobs/12896098-Data-Engineer--Batch-Capability-Specialist/
https://idexx.wd1.myworkdayjobs.com/IDEXX/job/Westbrook-ME/Backend-
Data-Developer_J-009005?shared_id=d653a212-173b-4964-a0d6-83e59d470226
https://connect.cousant.com/jobs/data-engineer/
https://xing-
se.jobbase.io/job/b5hqecpf?utm_source=Twitter&utm_campaign=twitter_blog&ut
m_medium=social
https://socialmedialab.ca/2020/02/11/big-data-engineer-and-spark-developer/
34