0% found this document useful (0 votes)
16 views34 pages

4 Data Engineering

This document provides an overview of data engineering, highlighting the role of data engineers in preparing and optimizing data for analysis, as well as the tools and technologies they use, particularly in cloud computing. It contrasts the responsibilities of data engineers with those of data scientists and outlines the challenges faced in self-service analytics. Additionally, it offers resources for becoming a data engineer and lists current job postings in the field.

Uploaded by

rtzvdpsw2x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views34 pages

4 Data Engineering

This document provides an overview of data engineering, highlighting the role of data engineers in preparing and optimizing data for analysis, as well as the tools and technologies they use, particularly in cloud computing. It contrasts the responsibilities of data engineers with those of data scientists and outlines the challenges faced in self-service analytics. Additionally, it offers resources for becoming a data engineer and lists current job postings in the field.

Uploaded by

rtzvdpsw2x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

An Overview of Data Engineering

 1 
Objectives
 In this lecture, you will be

 Exposed to the world of data engineering.

 Explore the differences between a data engineer and a data


scientist.

 Get an overview of the various tools data engineers use.

 Expand your understanding of how cloud technology plays a


role in data engineering.

 2 
Outline
 The DATA problem  Kinds of databases

 What is data engineering?  Processing tasks

 Tasks of the data engineer  Scheduling tools

 Data engineer or data  Cloud providers


scientist?
 Why cloud computing?
 Data engineering
problems  Big players in cloud
computing
 Tools of the data
engineer  Cloud services

 3 
The DATA Problem
 Preparing Data for Analytics is Hard.

 Data is often the biggest challenge of self-service analytics.

 Self-Service Analytics allows end users to easily


 analyze their data by building their own reports and modify
existing ones with little to no training.

 4 
The Data Problem in Self Service Analytics
 Half of the organizations are accessing external data sources.
 Data is scattered.

 5 
The Data Problem in Self Service Analytics
 Transactional systems are often not ready for analysis.
 Not optimized for analysis

 6 
The Data Problem in Self Service Analytics
 Database needs to be optimized so it becomes
 faster to query
 free of corrupt data

 7 
The DATA Problem Solution
 In comes the Data Engineer to rescue.

 8 
What is Data Engineering?
 The data engineer is one of the most valuable people in a
data-driven company that wants to scale up.
 It is data engineer’s task to make your life as a data scientist
easier.
 The data engineer
 Gather data from different sources
 Optimized database for analyses
 Removed corrupt data

 Data engineer focuses on


 processing and handling massive amount of data
 sitting up clusters of machines to do the computing.

 9 
Tasks of a Data Engineer
 The tasks of a data engineer consist of:
 developing a scalable data architecture (schema)
 )‫تطوير بنية بيانات قابلة للتطوير (مخطط‬

 streamlining data acquisition


 ‫تبسيط الحصول على البيانات‬

 setting up processes that bring data together from several


sources
 ‫إعداد عمليات تجمع البيانات من عدة مصادر‬

 safeguarding data quality by cleaning up corrupt data


 ‫حماية جودة البيانات عن طريق تنظيف البيانات الفاسدة‬

 10 
Tasks of a Data Engineer
 Data engineers design, build, and maintain data architectures
for large-scale applications.
 This career path requires strong software engineering skills

 Essentially, a data engineer needs to have the skills to build a


data pipeline that connects all the pieces of the data
ecosystem together and keep it up and running.

 Data engineering is the first — and arguably most crucial —


step for a successful data strategy.
 Data engineers make sure data scientists have the data they
need to perform data science.

 11 
Tasks of a Data Engineer
 They're responsible for:
 Accessing, collecting, auditing, and cleaning data from
applications and systems into a usable state.
 Creating and maintaining efficient databases.
 Building data pipelines.
 Monitoring and managing systems, including distributed
systems.

 12 
Tasks of a Data Engineer
 To emphasize just how important data engineering is for data
science, take a look at the following hierarchy of needs,
proposed by Monica Rogati.

 13 
Data Engineer vs Data Scientist
Data Engineer Data Scientist

 Develop scalable data  Mining data for patterns


architecture
 Statistical modeling
 Streamline data acquisition
 Predictive models using
 Set up processes to bring machine learning
together data
 Monitor business processes
 Clean corrupt data
 Clean outliers in data
 Well versed in cloud technology

 14 
Exercise 1
 There are some differences between the tasks of data
scientists and the tasks of data engineers.

 Below are three essential tasks that need to happen in a data-


driven company. Can you find the one that best fits the job of
a data engineer?

 Apply a statistical model to a large dataset to find outliers.

 Set up scheduled ingestion of data from the application


databases to an analytical database.

 Come up with a database schema for an application.

 15 
Exercise 2
 Classify the tasks in the correct color. Data engineer (red) or
the data scientist (blue).

 Cloud technology
 Mining data for patterns
 Monitor business processes
 Streamline data acquisition
 Clean statistical outliers in data
 Set up processes to bring together data
 Statistical modeling
 Develop scalable data architecture
 Predictive models using machine learning
 Clean corrupt data

 16 
Exercise 3
 Imagine you work in a medium-scale company that hosts an
online market for pet toys. As the company is growing, there
are unmistakably some technical growing pains.
 As the first data engineer, you observe some problems and
have to decide where you're best suited to be of help.

 Data scientists are querying the online store databases directly


and slowing down the functioning of the application since it's
using the same database.

 Harmful product recommendations are affecting the sales


numbers of the online store.

 The online store is slow because the application's database


server doesn't have enough memory.

 17 
Tools of the Data Engineer: Database Systems
 Data engineers are expert users of database systems.

 A database is a computer system that holds large


amounts of data.

 Applications rely on databases to provide certain


functionality.

 Other databases are used for analyses.

 The data engineer’s task begins and ends at databases.

 18 
Tools of the Data Engineer: Processing
 Tools for quickly processing data
 Clean data
 Aggregate data
 Join data
 Huge data have to be processed.
 That is where parallel processing comes into play.
 Data engineers use clusters of machines to process data.

 19 
Tools of the Data Engineer
 Scheduling tools help to make sure
 data moves from one place to another at the correct time, with
a specific intervals.
 Resolve dependency requirements of jobs
 Jobs run in the right order

 20 
A Data Pipeline

 21 
Exercise 4
 Can you identify the database in the schematics?

Accounting, Online Store, Product Catalog, and Analytics


 22 
Data Processing in the Cloud
 Data engineers are heavy users of the cloud.

 Data processing often runs on clusters of machines.

 Problem:
 self-host data-center
 Cover electrical and maintenance costs
 Peaks vs. quiet moments: hard to optimize

 Solution:
 use the cloud

 23 
Data Storage in the Cloud
 Problem:
 self-host data-center
 Disaster will strike
 Need different geographical locations

 Solution:
 use the cloud

 24 
The Big Three Cloud Providers
 Amazon Web Service (AWS)
 Microsoft Azure
 Google Cloud

market share
Provider Storage Computation Databases
in 2018
AWS
32% S3 EC2 RDS

Azure
17% Blob Virtual Machines SQL Database

Google Cloud
10% Cloud Storage Compute Engine Cloud SQL

 25 
Exercise 5
 You saw the benefits of using cloud computing as opposed
to self-hosting data centers.
 Can you select the most correct statement about cloud
computing?

 Cloud computing is always cheaper.

 The cloud can provide you with the resources you need, when
you need them.

 On premise machines give me full control over the situation


when things break.

 26 
Resources to Becoming a Data Engineer
 Microsoft Certified: Azure Data Engineer Associate
 Azure Data Engineers design and implement the
management, monitoring, security, and privacy of data using
the full stack of Azure data services to satisfy business needs.

 27 
Resources to Becoming a Data Engineer
 Udacity Nanodegree Program: Data Engineer
 The course goes on to teach in the areas of SQL, Spark, Data
Warehousing on AWS, Apache Airflow etc.

 28 
Resources to Becoming a Data Engineer
 Google Cloud Platform Certification: Professional Data
Engineer
 Data Engineering on Google Cloud Platform.
 Both the certification and training are short stints and go on to
teach you about using Hadoop, Google BigQuery, and
building scalable machine learning applications on GCP.

 29 
Resources to Becoming a Data Engineer
 Dataquest: Data Engineer
 This is a good course for someone beginning their journey
into the data engineering landscape but because of the course
structure it seems to be useful to have some basic Python
knowledge at the least.
 The course begins with an introduction to Python and moves
onto SQL which develops further into learning how to use
PostgresSQL and Data Structures and Algorithms.

 30 
Resources to Becoming a Data Engineer
 UC San Diego: Big Data Specialization
 The course by the University of California San Diego's
course on Coursera centers around using the Hadoop
framework and Spark and applying these big data handling
techniques in a machine learning instance at the end.

 31 
Resources to Becoming a Data Engineer
 AWS Certified Big Data - Specialty
 Because this certification is for advanced users, it requires
you to have a few years experience using AWS and having
other certifications such as AWS Certified Cloud Practitioner

 32 
Current DE Job Posts
 https://twitter.com/MTCyberStaffing

 https://referrals.ibm.com/ibmreppify/jobs/214481?bpin=1046&com=2&rin=58344
0

 https://redolentech.catsone.com/careers/5101-General/jobs/13183549-Data-
Engineer-SparkFlinkScala-Engineer/

 https://www.indeed.com/viewjob?jk=9f7f3b843332d2d3&from=employertweet

 https://remotepad.io/remote-jobs/1011-remote-data-engineer

 http://www.recruiter-
directory.info/jobs/?q=Big+Data+Engineer&l=Whippany+NJ+USA&z=&tw=&k
=

 https://jobs.findyourflex.co.uk/job/big-data-
engineer/?utm_source=twitter&utm_medium=social&utm_campaign=Lloyds_Feb
_20&utm_content=joboftheweek

 33 
Current DE Job Posts
 https://www.modis.com/en-au/jobs/job/north-sydney/big-data-engineer-
/BROADBEAN_564281581389373/

 https://provenstaffingsolutions.catsone.com/careers/11273-
General/jobs/12896092-Data-Engineer--Streaming-Platform-Integrator/

 https://provenstaffingsolutions.catsone.com/careers/11273-
General/jobs/12896098-Data-Engineer--Batch-Capability-Specialist/

 https://idexx.wd1.myworkdayjobs.com/IDEXX/job/Westbrook-ME/Backend-
Data-Developer_J-009005?shared_id=d653a212-173b-4964-a0d6-83e59d470226

 https://connect.cousant.com/jobs/data-engineer/

 https://xing-
se.jobbase.io/job/b5hqecpf?utm_source=Twitter&utm_campaign=twitter_blog&ut
m_medium=social

 https://socialmedialab.ca/2020/02/11/big-data-engineer-and-spark-developer/

 34 

You might also like