0% found this document useful (0 votes)
11 views

MIT data engineering

MIT data engineering Curriculum

Uploaded by

Alok Tiwary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

MIT data engineering

MIT data engineering Curriculum

Uploaded by

Alok Tiwary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

PROFESSIONAL CERTIFICATE

IN DATA ENGINEERING
Gain cutting-edge skills to advance your data engineering career.

Delivered in collaboration with


Overview
As the use of technology expands, data engineering is becoming increasingly vital, and the demand for the specialized
expertise of data engineers is growing. In fact, according to the Dice Tech Job Report, data engineering was the fastest
growing tech occupation in 2020. Why? Because before data scientists can gain useful information from the mountains of
data possessed by today's organizations, the data must be configured, warehoused, and made accessible, and data
engineers are responsible for building the infrastructure.

The MIT xPRO Professional Certificate in Data Engineering is an immersive 6–month program that’s designed to provide you
with job-ready, in-demand data engineering skills and a competitive edge in the marketplace. Through an exploration of
core concepts, tools, techniques, and best practices, participants will learn data engineering essentials, from building
effective data architectures and data warehouses to designing data models, streamlining data processing, automating
data pipelines, data wrangling, and big data engineering. Participants also receive personalized feedback, live office hours
with learning facilitators, and the opportunity to develop a GitHub portfolio for potential employers.

MIT xPRO’s online learning programs showcase industry-aligned content from world-renowned experts to make learning
accessible anytime, anywhere, and solve challenges for developing technical professionals.

Take the next step to launch your data engineering career.

Price Duration $128,631


USD 7,800 6 months, online The average annual pay for a data
15–20 hours per week engineer in the United States.
(Source: Indeed, 2023)

“Data engineers build the ‘nervous system’ of the company. Without it, the company cannot react to changes in
the external business environment or within the organization. They build the software and hardware systems
that power the company’s vision and are masters not just of software, but of hardware, networks, and analytic
apps that are changing everyday data.”

– John R. Williams, Professor of Information Engineering, MIT


Program Highlights
Earn a certificate and 36 continuing education units (CEUs) from MIT xPRO

Gain insights with coding demos from renowned MIT faculty

Learn market-ready data engineering skills for use in a high-growth market

Build a GitHub portfolio of your projects to share with potential employers


Program Experience

20+ hours 5+ hours 8 hours 19 career


of prerecorded of live mentorship of optional career development video
MIT faculty videos and career support* development activities* lectures covering
30 career topics*

Sample Weekly Program Planner


Participants should expect to dedicate a minimum of 15–20 hours per week to the program.

1 hour of recorded video lectures


from faculty

Live interaction with program


learning facilitators**

2 hours of self-study and


practice exercises

2 hours of group discussions with


peers to exchange and generate ideas

13 hours of rigorous, graded assignments


to apply and reinforce the lecture material

*Services provided by Emeritus, a learning partner for this program.


**The schedule for live interactions are subject to change based on availability and will be confirmed once
the program starts.
Learning Journey
From navigating the enrollment process to identifying job opportunities, we partner with you to take the next
step in your career.

LEARNING COMMUNITY
Your learning community will provide an
LEARNING FACILITATOR interactive environment where you can
learn with a group of like-minded
Your learning facilitator will leverage individuals and build a global network
their industry experience and of peers.
expertise to guide you by holding live
sessions, providing assignment
feedback, and answering questions.

PROGRAM ADVISOR
CAREER COACH
Your program advisor will be
Your career coach will help
your enrollment resource,
you successfully navigate your job search by
answering any pre-program
assisting with goal setting, providing feedback on
questions and easing your
your cover letter, résumé, and LinkedIn profile, and
transition into the program.
conducting mock interviews. They will be a source
of up-to-the-minute information on hiring trends
and help celebrate the next step in your career.
Tools and Resources in the Program
The Professional Certificate in Data Engineering program employs the latest industry tools and resources, including:

· MySQL Driver · Wget


· Hadoop · MySQL Shell · PySpark
· Docker · Git · Airflow
· Cassandra · Postman · Mosquitto
· Google Colab · Strapi · ThingsBoard
· Jupyter Notebook · Flask Web Server · Web Sockets
· Kafka · Java · Node.js
· GitHub · Nano Text Editor · OAuth2
· MySQL · Debezium · Okta
· Spark · Spring Boot · MongoDB
· Firebase · Curl · Redis
· Swagger · Mapbox · JSON Web Tokens
· MySQL Workbench · Maven · JavaScript
· Visual Studio Code · NiFi · YAML
· MySQL Python Connector

Python libraries · NumPy · Feather


· pandas · Lorem
you will work with: · DASK · Node-rdkafka
· SciPy · Graphviz
· TensorFlow · Data-Driven Documents (D3)
· Matplotlib · NotebookJS
· Seaborn · NLTK
· Scikit-learn · Paho
· OpenSSL · Express
· Gym · Kafka-Python
· Arrow

“Data engineering really is a core component of today’s data infrastructure. And because organizations can’t
function without data, it’s also a career with a great deal of opportunity and incredibly interesting work
as well.”
– Abel Sanchez, Research Scientist and Executive Director,
MIT’s Geospatial Data Center
Who Is This Program For?

Career launchers: Recent STEM graduates/post-graduates/interns looking to start a


career in this high-growth field by gaining exposure to data engineering.

Career builders: Early career software engineers/technology professionals seeking training


in the latest data engineering tools and techniques to advance their careers.

Career switchers: Mid-career professionals aiming to switch to data engineering from


information technology, analytics, finance, project management, supply chain, or other
technical fields.

Applicants must have: Also recommended:


A bachelor's degree or higher An educational background in STEM fields
Strong math skills Technical work experience
Some experience with Python, R, or SQL
Some experience with statistics and calculus

Prepare for these potential job titles:

Data Engineer Data Science Engineer Data Integration Engineer


Big Data Engineer Data Platform Engineer Data Infrastructure Engineer
Data Software Engineer Python Data Engineer Data Systems Engineer
Data Analytics Engineer Cloud Data Engineer Business Intelligence Engineer
Data Warehouse Engineer

*Participants must be 18 or above to apply for this program.


Key Takeaways
This program is designed to give you the skills you need to start or continue your career in
data engineering. High-level learning outcomes for this program include:

Develop and analyze databases using data science and data engineering tools and skills, including SQL and Python
• Use mulitple Python libraries including NumPy, pandas, DASK, SciPy, TensorFlow, Matplotlib,
Seaborn, Scikit-learn, OpenSSL, Gym, Arrow, Feather, Lorem, Node-rdkafka, Graphviz, Data-Driven
Documents (D3), NotebookJS, NLTK, Paho, Express, and Kafka-Python
• Design databases conceptually and formally
• Perform extract, transform, and load (ETL) on a dataset
• Perform change data capture (CDC)
• Develop a web application in Java
• Connect a database to Debezium
• Create an application using web tokens
• Build a transit data application using Mapbox and Maven
• Use NiFi to create an ETL pipeline
• Utilize Hadoop to handle big data
• Use Docker to create and manipulate Spark images and containers
• Use PySpark to query data
• Create a workflow in Airflow
• Learn database containerization, how to use containers when working with databases, and how to
run queries to interface with a database container
• Learn data visualization

Configure a network to ensure data security


• Identify the key concepts of security, encryption, and authentication
• Develop a web application in Java
• Define web token architecture and create an application using web tokens

Implement artificial intelligence (AI)/machine learning (ML) algorithms, including those for reinforcement
learning and deep neural networks
• Learn the fundamental concepts of reinforcement learning, including the reward matrix, the quality
matrix, the Bellman equation, and deep neural networks
• Apply gradient descent to reduce error
• Implement the Naïve Bayes and Gaussian Naïve Bayes theorems and k-means using Scikit-learn

Manage big data using data warehousing and workflow management platforms
• Run parallel operations in DASK
• Stream data through web sockets
• Identify the key concepts related to visualization, unstructured data, and JavaScript
• Create a sensemaking data pipeline

Build a user interface to view and interact with large amounts of live streaming data
• Discuss use cases for Mosquitto
• Stream live data to ThingsBoard
• Analyze live streaming data using ThingsBoard
• Construct a web server using Kafka

Create a GitHub portfolio to present the projects that you create to potential employers
Program Learning Objectives
• Explain key data science and data engineering concepts

• Develop and analyze databases using data science and data engineering tools and skills, including SQL and Python

• Configure a network to ensure data security

• Manage big data using data warehousing and workflow management platforms

• Implement AI/ML algorithms, including those for reinforcement learning and deep neural networks

• Build a user interface to view and interact with large amounts of live streaming data

Program Modules
1. Introduction to Python
2. Python: Introduction to NumPy
3. Python: pandas
4. Databases: SQL
5. Databases: Basic SQL Statements
6. Database Analysis and the Client–Server Interface
7. A Model to Predict Housing Prices
8. ETL, Analysis, and Visualization
9. GitHub and Advanced Python Functions
10. Software Engineering Basics
11. Basics of Client–Server Architecture
12. Types of Databases and Database Containerization
13. CDC
14. Java and Debezium
15. Using Advanced Python Programming to Create Web Applications
16. Transit Data and Application Programming Interfaces (AFIs)
17. Performing ETL Using NiFi
18. Platforms for Handling Big Data
19. Processing Big Data with Spark and Airflow
20. Introduction to ML and Advanced Probability
21. Introduction to Reinforcement Learning and Deep Neural Networks
22. Processing and Streaming Big Data
23. Creating a Data Pipeline
24. Handling Big Data with Mosquitto, ThingsBoard, and Kafka
Program Schedule
This program is organized into three main sections:

Section 1
In the first section of the program, you will learn the basics of the Python programming language, how
to work with relational databases using SQL, and how to work with Python to create databases and
server pipelines.
Modules 1–3: Python, NumPy, Matplotlib, and pandas
• You will work with multiple cutting-edge Python libraries including NumPy, Matplotlib, and pandas.
• You will use basic data types and advanced structures in Python, such as lists, tuples, sets,
and dictionaries.
Modules 4–6: Relational Databases and SQL
• You will write complex database queries, use Regular Expressions, clean a database, define drivers to
read a table, write files to your database, and write YAML files.
Modules 7 and 8: Portfolio Projects
• You will build a prediction modeling using linear regression.
• You will use ETL to analyze a dataset and then visualize the results using Matplotlib.

Section 2
In the second section of the program, you will learn more advanced Python functions and create a
GitHub portfolio to present your projects to potential employers. Then, you will dive deeper into various
command line and data security tools. You will work on tasks such as database containerization, CDC,
and data wrangling.
Modules 9–11: GitHub, Docker, Visual Studio Code, and Flask
• You will learn more advanced Python constructs, such as classes, wrappers, and decorators. You will
work with a number of software engineering tools, including Postman, Docker, Flask, Bootstrap,
cookies, and security tools. You will practice using command line commands, asynchronous
event-driven programming, HTTP structure, and creating APIs.
Modules 12–14: Database Containerization, CDC, Java, and Debezium
• You will learn how to use containers when working with databases and how to run queries to
interface with a database container. You will manipulate data and perform CDC in different types
of databases including MongoDB, Cassandra, Redis, and Firebase. You will practice the basics of the
Java programming language and use Debezium to perform CDC on containers.
Modules 15 and 16: Portfolio Projects
• You will create a web application and use JavaScript Open Notation (JSON) web tokens,
authentication, and authorization to create security features. You will also manipulate a database
using Python redundant dictionaries.
• You will use Mapbox and Maven to build a transit data application
Section 3
In the third section of the program, you will explore the tools that are used to manage big data and data
warehousing. You will learn how ML, reinforcement learning, advanced probability, and deep neural networks
are integrated into data engineering.
Modules 17–19
• You will use NiFi to construct an ETL pipeline and work with Hadoop, Spark, and Airflow to create data
pipelines for big data processing. You will use PySpark to query big data.
Modules 20–22
• You will practice foundational ML mathematical algorithms and implement the Python Scikit-learn
library. You will stream big data using the pandas, Parquet, and Feather libraries. You will use the
DASK library to create, read, write, and analyze multiple files in parallel and simulate parallel
processing across distributed machines.
Modules 23 and 24: Portfolio Projects
• You will clean data from web pages and use JavaScript, the Document Object Model and HyperText
Markup Language to create a sensemaking data pipeline. You will visualize your data with the
JavaScript D3 library.
• You will use an MQTT protocol to produce temperature and humidity data and publish data to
ThingsBoard. You will use Kafka to create a Python application that publishes vehicles’ location data
to a Kafka topic. Finally, you will use Node.js to construct a web server that acts as a consumer for
the messages received from the Kafka broker.

Note: Break weeks are included to cover project assignment work and prepare for upcoming modules.
Assignments and Portfolio Projects
Each module includes engaging assignments and culminates in at least one GitHub portfolio project that you’ll
complete based on what you have learned in that portion of the program.

Assignments

Peer discussions Interactive activities Practice exercises Knowledge checks

Coding Exercises

Coding exercises are integrated into various modules through simple activities using Jupyter Notebook. They
allow you to practice building composite skills to prepare you for the assignments and portfolio projects.

Portfolio Projects

Build a predictive ML model involving feature selection for linear regression

Build a reinforcement learning model for robot navigation (from scratch in Python)

Run TensorFlow for a deep neural network model (Deep Dream in Colab)

Build a producer/subscriber broker for visualizing streaming MQTT sensor data

Stream load 100 million lines of data and create and write 20 files in parallel using DASK

Protect your web server using JSON web tokens

You will receive personalized feedback from your program leaders on your GitHub repositories, securing a
market-ready portfolio that’s ready to share with potential employers.
Program Faculty

Dr. John R. Williams holds a B.A. in physics from Oxford


University, an M.S. in physics from UCLA, and a Ph.D. in numeri-
cal methods from the University of Wales, Swansea. His
research focuses on the application of large-scale computation
to problems in cyber-physical security and energy studies. He is
director of MIT’s Geospatial Data Center, and, from 2006—2012,
was director of the MIT Auto-ID Laboratory, which invented the
Internet of Things (IoT).

Dr. WIlliams is the author or co-author of over 250 journal and


conference papers, as well as the book, RFID Technology and
Applications. He contributed to the 2013 report for the United
Kingdom's Goverment Office for Science Foresight project — The
John R. WIlliams Future of Manufacturing.
Professor of Information Engineering, MIT
Department of Civil and Environmental Alongside Bill Gates and Larry Ellison, he was named as one of
Engineering
the 50 most powerful people in computer networks. He
consults for organizations including Accenture, Schlumberger,
SAP Research, Microsoft Research, Kajima Corp, U.S. Lincoln
Laboratory, Sandia National Laboratories, U.S. Intelligence
Advanced Research Projects Activity, Motorola, Phillip-Morris
Inc., Ford Motor Company, ExxonMobil, Shell, Total, and
ARAMCO.

Dr. Williams' international collaborations include HKUST and


PolyU (Hong Kong), the University of Cambridge and the
Imperial College of Science and Technology (United Kingdom),
Malaysia University of Science and Technology, KACST (Saudi
Arabia), and Masdar Institute of Science and Technology (Abu
Dhabi).

He organized the first Cyber-Physical Security Conference in


the United Kingdom (2011), and, along with Dr. Sanchez, he runs
the MIT Applied Cybersecurity Professional Education summer
program. At MIT, he teaches courses in Architecting Software
Systems (MIT 1.125) and Engineering Computation and Data
Science (MIT 1.00/1.001).
Program Faculty

Dr. Abel Sanchez holds a Ph.D. from MIT. He is the executive


director of MIT’s Geospatial Data Center, architect of the IoT
global network, and architect of data analytics platforms for
SAP, Ford, Johnson & Johnson, Accenture, Shell, ExxonMobil,
and Altria. In cybersecurity, Dr. Sanchez architected an impact
analysis of large-scale cyber attacks, designing Cyber Ranges
for the Department of Defense.

In password security, Dr. Sanchez led the design of a password


firewall (negative authentication) for the Intelligence Advanced
Research Projects Activity agency. In ML, addressing fraud
detection, Dr. Sanchez designed a situational awareness
Abel Sanchez framework that exploits different perspectives of the same
Research Scientist and Executive Director, the data and assigns risk scores to entities for Accenture.
MIT Geospatial Data Center
He led the design of a global data infrastructure simulator,
modeling follow-the-sun engineering, to evaluate the impact of
competing architectures on the performance, availability, and
reliability of the system for Ford Motor Company. He is involved
in developing e-learning software for Microsoft via their
I-Campus Program and establishing the Accenture Technology
Academy, an online resource for over 200,000 employees.

He has 10 years of experience with learning management


systems that have been deployed in America, Asia, and Europe.
He teaches MIT courses on cybersecurity, engineering
computation, and data science, and he has produced over 150
educational videos.
Career Preparation and Guidance
This program offers a wide array of career support and guidance to help you develop your career path. These services
are provided by Emeritus, our learning collaborator for this program, via the Emeritus Career Center (ECC). The primary
goal is to help you build the skills needed to prepare for your career, however we do not guarantee job placement. Learn
more about all of the services and support available to you, including:

A SUPPORT TEAM YOU CAN RELY ON


Your support team includes program leaders and career coaches who will help you reach your learning goals and
guide you through your job search.

CAREER PREPARATION SERVICES THAT GET YOU NOTICED

Write noteworthy resumes


Prepare for interviews
and cover letters

Create effective LinkedIn profiles Craft your elevator pitch

Navigate your job search Negotiate your salary


Emeritus Career Center

COACHING
Schedule appointments with a career coach, and
share details about new jobs and job search outcomes
with them.

EVENT ACCESS
Learn about upcoming career events,
register to attend, and view previously
recorded webinars.
Comprehensive career services.
One convenient location.
DOCUMENT STORAGE
The Emeritus Career Center is your one-stop Store your resume and other application
shop for streamlined access to materials in one convenient location.
career-related services. As a participant in
the program, you will gain access to the ECC
and its related benefits for 12 months from RESOURCE LIBRARY
the program start date. Benefits include: Access our growing Resource Library
anytime to find job search resources, resume
checklists, and other helpful information.

CAREER PROFILE
Create a profile for networking.

These services are provided by Emeritus, our learning collaborator for this program.

UNLOCK ADDITIONAL BENEFITS: SHARE YOUR RESUME


Upload your resume to the ECC for approval and take advantage of:

• A resume review and feedback from your career coach.


Financing Options
We want to make sure that the Professional Certificate in Data Engineering program is an affordable option for all.
This is why we offer you multiple different ways to pay for the program.

Loan Partners (For US Residents)


Climb Credit Immediate repayment, interest-only repayment, and deferred payment options are available.

Visit the Climb Credit application portal

Fill in your basic details and proceed to the loan section of the application

Select ‘Emeritus/MIT xPRO’ under the Campus dropdown, ‘Professional Certificate in Data Engineering’ from the
Program dropdown, and enter your program start date

Choose your preferred repayment option and enter financial information

Agree to the disclosure and submit your application

Our program advisors will contact you for a confirmation on your loan application*

After confirmation, we will certify your loan. You will receive a welcome email with login instructions from
[email protected] within 3 business days

Sallie Mae Fixed repayment, interest-only repayment, and deferred payment options are available.

Visit the Sallie Mae application portal

Fill in your basic details and proceed to the loan application page

Select ‘Undergraduate Students’ when prompted

Choose from fixed repayment, interest-only repayment, and deferred payment options, and submit
your application

Our program advisors will contact you for a confirmation on your loan application*

After confirmation, we will certify your loan. You will receive a welcome email with login instructions
from [email protected] within three business days
Flexible Payment Options (For All Countries)
Choose to make your payment in two, three, or six installments for higher flexibility.

Complete your application for the Professional Certificate in Data Engineering, and enroll in the program.

You can opt for any one of the financing options to cover up to the full cost of the program tuition. If you are considering
financing your program through one of our partners, the enrollment process can only be completed with the assistance of
your program advisor or by calling +1 315 640 4846.

*Due to processing time, the loan application should be submitted no later than four business days prior to the enrollment deadline.
Certificate
Get recognized! Upon successful completion of this
program, MIT xPRO grants a certificate of completion to
participants and 36 CEUs. This program is graded as a

E
pass or fail; participants must receive 75% to pass and
obtain the certificate of completion.
This is to certify that

P L
AM
Your Name
After successful completion of the program, your verified has successfully completed

S
Professional Certificate in Data Engineering
digital certificate will be emailed to you, at no additional Awarded 36 Continuing Education Units (CEUs)

Date
cost, with the name you used when registering for the
program. All certificate images are for illustrative Eric Grimson
Vice President for Open Learning
Massachusetts Institute of Technology
John R. Williams
Professor of Information Engineering in
MIT Department of Civil and
Abel Sánchez
Research Scientist and
Executive Director of MIT’s
Environmental Engineering Geospatial Data Center

purposes only and may be subject to change at the


discretion of MIT.

About MIT xPRO


MIT xPRO’s online learning programs leverage vetted content from world-renowned experts to make learning accessible
anytime, anywhere. Designed using cutting-edge research in the neuroscience of learning, MIT xPRO programs are
application focused, helping professionals build their skills on the job. To explore the full catalog of MIT xPRO courses and
programs, visit: xpro.mit.edu.

About Emeritus
MIT xPRO is collaborating with online education provider Emeritus to deliver this online course through a dynamic,
interactive, digital learning platform. This course leverages MIT xPRO's thought leadership in engineering and management
practice developed over years of research, teaching, and practice.
You can schedule a call with a program advisor
from Emeritus to learn more about this Refer your colleague
MIT xPRO program. and receive a benefit:

SCHEDULE A CALL REFER NOW

You can apply to the program here. Connect with a program advisor:
Email: [email protected]
APPLY Phone: U.S.: +1 315 640 4846
U.K.: +44 1416 736416
Singapore: +65 3138 2327

Delivered in collaboration with

You might also like