MIT data engineering
MIT data engineering
IN DATA ENGINEERING
Gain cutting-edge skills to advance your data engineering career.
The MIT xPRO Professional Certificate in Data Engineering is an immersive 6–month program that’s designed to provide you
with job-ready, in-demand data engineering skills and a competitive edge in the marketplace. Through an exploration of
core concepts, tools, techniques, and best practices, participants will learn data engineering essentials, from building
effective data architectures and data warehouses to designing data models, streamlining data processing, automating
data pipelines, data wrangling, and big data engineering. Participants also receive personalized feedback, live office hours
with learning facilitators, and the opportunity to develop a GitHub portfolio for potential employers.
MIT xPRO’s online learning programs showcase industry-aligned content from world-renowned experts to make learning
accessible anytime, anywhere, and solve challenges for developing technical professionals.
“Data engineers build the ‘nervous system’ of the company. Without it, the company cannot react to changes in
the external business environment or within the organization. They build the software and hardware systems
that power the company’s vision and are masters not just of software, but of hardware, networks, and analytic
apps that are changing everyday data.”
LEARNING COMMUNITY
Your learning community will provide an
LEARNING FACILITATOR interactive environment where you can
learn with a group of like-minded
Your learning facilitator will leverage individuals and build a global network
their industry experience and of peers.
expertise to guide you by holding live
sessions, providing assignment
feedback, and answering questions.
PROGRAM ADVISOR
CAREER COACH
Your program advisor will be
Your career coach will help
your enrollment resource,
you successfully navigate your job search by
answering any pre-program
assisting with goal setting, providing feedback on
questions and easing your
your cover letter, résumé, and LinkedIn profile, and
transition into the program.
conducting mock interviews. They will be a source
of up-to-the-minute information on hiring trends
and help celebrate the next step in your career.
Tools and Resources in the Program
The Professional Certificate in Data Engineering program employs the latest industry tools and resources, including:
“Data engineering really is a core component of today’s data infrastructure. And because organizations can’t
function without data, it’s also a career with a great deal of opportunity and incredibly interesting work
as well.”
– Abel Sanchez, Research Scientist and Executive Director,
MIT’s Geospatial Data Center
Who Is This Program For?
Develop and analyze databases using data science and data engineering tools and skills, including SQL and Python
• Use mulitple Python libraries including NumPy, pandas, DASK, SciPy, TensorFlow, Matplotlib,
Seaborn, Scikit-learn, OpenSSL, Gym, Arrow, Feather, Lorem, Node-rdkafka, Graphviz, Data-Driven
Documents (D3), NotebookJS, NLTK, Paho, Express, and Kafka-Python
• Design databases conceptually and formally
• Perform extract, transform, and load (ETL) on a dataset
• Perform change data capture (CDC)
• Develop a web application in Java
• Connect a database to Debezium
• Create an application using web tokens
• Build a transit data application using Mapbox and Maven
• Use NiFi to create an ETL pipeline
• Utilize Hadoop to handle big data
• Use Docker to create and manipulate Spark images and containers
• Use PySpark to query data
• Create a workflow in Airflow
• Learn database containerization, how to use containers when working with databases, and how to
run queries to interface with a database container
• Learn data visualization
Implement artificial intelligence (AI)/machine learning (ML) algorithms, including those for reinforcement
learning and deep neural networks
• Learn the fundamental concepts of reinforcement learning, including the reward matrix, the quality
matrix, the Bellman equation, and deep neural networks
• Apply gradient descent to reduce error
• Implement the Naïve Bayes and Gaussian Naïve Bayes theorems and k-means using Scikit-learn
Manage big data using data warehousing and workflow management platforms
• Run parallel operations in DASK
• Stream data through web sockets
• Identify the key concepts related to visualization, unstructured data, and JavaScript
• Create a sensemaking data pipeline
Build a user interface to view and interact with large amounts of live streaming data
• Discuss use cases for Mosquitto
• Stream live data to ThingsBoard
• Analyze live streaming data using ThingsBoard
• Construct a web server using Kafka
Create a GitHub portfolio to present the projects that you create to potential employers
Program Learning Objectives
• Explain key data science and data engineering concepts
• Develop and analyze databases using data science and data engineering tools and skills, including SQL and Python
• Manage big data using data warehousing and workflow management platforms
• Implement AI/ML algorithms, including those for reinforcement learning and deep neural networks
• Build a user interface to view and interact with large amounts of live streaming data
Program Modules
1. Introduction to Python
2. Python: Introduction to NumPy
3. Python: pandas
4. Databases: SQL
5. Databases: Basic SQL Statements
6. Database Analysis and the Client–Server Interface
7. A Model to Predict Housing Prices
8. ETL, Analysis, and Visualization
9. GitHub and Advanced Python Functions
10. Software Engineering Basics
11. Basics of Client–Server Architecture
12. Types of Databases and Database Containerization
13. CDC
14. Java and Debezium
15. Using Advanced Python Programming to Create Web Applications
16. Transit Data and Application Programming Interfaces (AFIs)
17. Performing ETL Using NiFi
18. Platforms for Handling Big Data
19. Processing Big Data with Spark and Airflow
20. Introduction to ML and Advanced Probability
21. Introduction to Reinforcement Learning and Deep Neural Networks
22. Processing and Streaming Big Data
23. Creating a Data Pipeline
24. Handling Big Data with Mosquitto, ThingsBoard, and Kafka
Program Schedule
This program is organized into three main sections:
Section 1
In the first section of the program, you will learn the basics of the Python programming language, how
to work with relational databases using SQL, and how to work with Python to create databases and
server pipelines.
Modules 1–3: Python, NumPy, Matplotlib, and pandas
• You will work with multiple cutting-edge Python libraries including NumPy, Matplotlib, and pandas.
• You will use basic data types and advanced structures in Python, such as lists, tuples, sets,
and dictionaries.
Modules 4–6: Relational Databases and SQL
• You will write complex database queries, use Regular Expressions, clean a database, define drivers to
read a table, write files to your database, and write YAML files.
Modules 7 and 8: Portfolio Projects
• You will build a prediction modeling using linear regression.
• You will use ETL to analyze a dataset and then visualize the results using Matplotlib.
Section 2
In the second section of the program, you will learn more advanced Python functions and create a
GitHub portfolio to present your projects to potential employers. Then, you will dive deeper into various
command line and data security tools. You will work on tasks such as database containerization, CDC,
and data wrangling.
Modules 9–11: GitHub, Docker, Visual Studio Code, and Flask
• You will learn more advanced Python constructs, such as classes, wrappers, and decorators. You will
work with a number of software engineering tools, including Postman, Docker, Flask, Bootstrap,
cookies, and security tools. You will practice using command line commands, asynchronous
event-driven programming, HTTP structure, and creating APIs.
Modules 12–14: Database Containerization, CDC, Java, and Debezium
• You will learn how to use containers when working with databases and how to run queries to
interface with a database container. You will manipulate data and perform CDC in different types
of databases including MongoDB, Cassandra, Redis, and Firebase. You will practice the basics of the
Java programming language and use Debezium to perform CDC on containers.
Modules 15 and 16: Portfolio Projects
• You will create a web application and use JavaScript Open Notation (JSON) web tokens,
authentication, and authorization to create security features. You will also manipulate a database
using Python redundant dictionaries.
• You will use Mapbox and Maven to build a transit data application
Section 3
In the third section of the program, you will explore the tools that are used to manage big data and data
warehousing. You will learn how ML, reinforcement learning, advanced probability, and deep neural networks
are integrated into data engineering.
Modules 17–19
• You will use NiFi to construct an ETL pipeline and work with Hadoop, Spark, and Airflow to create data
pipelines for big data processing. You will use PySpark to query big data.
Modules 20–22
• You will practice foundational ML mathematical algorithms and implement the Python Scikit-learn
library. You will stream big data using the pandas, Parquet, and Feather libraries. You will use the
DASK library to create, read, write, and analyze multiple files in parallel and simulate parallel
processing across distributed machines.
Modules 23 and 24: Portfolio Projects
• You will clean data from web pages and use JavaScript, the Document Object Model and HyperText
Markup Language to create a sensemaking data pipeline. You will visualize your data with the
JavaScript D3 library.
• You will use an MQTT protocol to produce temperature and humidity data and publish data to
ThingsBoard. You will use Kafka to create a Python application that publishes vehicles’ location data
to a Kafka topic. Finally, you will use Node.js to construct a web server that acts as a consumer for
the messages received from the Kafka broker.
Note: Break weeks are included to cover project assignment work and prepare for upcoming modules.
Assignments and Portfolio Projects
Each module includes engaging assignments and culminates in at least one GitHub portfolio project that you’ll
complete based on what you have learned in that portion of the program.
Assignments
Coding Exercises
Coding exercises are integrated into various modules through simple activities using Jupyter Notebook. They
allow you to practice building composite skills to prepare you for the assignments and portfolio projects.
Portfolio Projects
Build a reinforcement learning model for robot navigation (from scratch in Python)
Run TensorFlow for a deep neural network model (Deep Dream in Colab)
Stream load 100 million lines of data and create and write 20 files in parallel using DASK
You will receive personalized feedback from your program leaders on your GitHub repositories, securing a
market-ready portfolio that’s ready to share with potential employers.
Program Faculty
COACHING
Schedule appointments with a career coach, and
share details about new jobs and job search outcomes
with them.
EVENT ACCESS
Learn about upcoming career events,
register to attend, and view previously
recorded webinars.
Comprehensive career services.
One convenient location.
DOCUMENT STORAGE
The Emeritus Career Center is your one-stop Store your resume and other application
shop for streamlined access to materials in one convenient location.
career-related services. As a participant in
the program, you will gain access to the ECC
and its related benefits for 12 months from RESOURCE LIBRARY
the program start date. Benefits include: Access our growing Resource Library
anytime to find job search resources, resume
checklists, and other helpful information.
CAREER PROFILE
Create a profile for networking.
These services are provided by Emeritus, our learning collaborator for this program.
Fill in your basic details and proceed to the loan section of the application
Select ‘Emeritus/MIT xPRO’ under the Campus dropdown, ‘Professional Certificate in Data Engineering’ from the
Program dropdown, and enter your program start date
Our program advisors will contact you for a confirmation on your loan application*
After confirmation, we will certify your loan. You will receive a welcome email with login instructions from
[email protected] within 3 business days
Sallie Mae Fixed repayment, interest-only repayment, and deferred payment options are available.
Fill in your basic details and proceed to the loan application page
Choose from fixed repayment, interest-only repayment, and deferred payment options, and submit
your application
Our program advisors will contact you for a confirmation on your loan application*
After confirmation, we will certify your loan. You will receive a welcome email with login instructions
from [email protected] within three business days
Flexible Payment Options (For All Countries)
Choose to make your payment in two, three, or six installments for higher flexibility.
Complete your application for the Professional Certificate in Data Engineering, and enroll in the program.
You can opt for any one of the financing options to cover up to the full cost of the program tuition. If you are considering
financing your program through one of our partners, the enrollment process can only be completed with the assistance of
your program advisor or by calling +1 315 640 4846.
*Due to processing time, the loan application should be submitted no later than four business days prior to the enrollment deadline.
Certificate
Get recognized! Upon successful completion of this
program, MIT xPRO grants a certificate of completion to
participants and 36 CEUs. This program is graded as a
E
pass or fail; participants must receive 75% to pass and
obtain the certificate of completion.
This is to certify that
P L
AM
Your Name
After successful completion of the program, your verified has successfully completed
S
Professional Certificate in Data Engineering
digital certificate will be emailed to you, at no additional Awarded 36 Continuing Education Units (CEUs)
Date
cost, with the name you used when registering for the
program. All certificate images are for illustrative Eric Grimson
Vice President for Open Learning
Massachusetts Institute of Technology
John R. Williams
Professor of Information Engineering in
MIT Department of Civil and
Abel Sánchez
Research Scientist and
Executive Director of MIT’s
Environmental Engineering Geospatial Data Center
About Emeritus
MIT xPRO is collaborating with online education provider Emeritus to deliver this online course through a dynamic,
interactive, digital learning platform. This course leverages MIT xPRO's thought leadership in engineering and management
practice developed over years of research, teaching, and practice.
You can schedule a call with a program advisor
from Emeritus to learn more about this Refer your colleague
MIT xPRO program. and receive a benefit:
You can apply to the program here. Connect with a program advisor:
Email: [email protected]
APPLY Phone: U.S.: +1 315 640 4846
U.K.: +44 1416 736416
Singapore: +65 3138 2327