0% found this document useful (0 votes)
7 views

EXA Data Roadmap_ based on MIT Applied Data Science Program

The EXA Data Science Roadmap offers a free structured guide based on MIT’s Applied Data Science Certificate Program, aimed at individuals seeking careers in data science and machine learning. It covers foundational topics such as Python and statistics, advanced machine learning techniques, and includes a capstone project for practical application. The program is designed for various experience levels and does not require prior programming or math knowledge.

Uploaded by

kishor bhole
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

EXA Data Roadmap_ based on MIT Applied Data Science Program

The EXA Data Science Roadmap offers a free structured guide based on MIT’s Applied Data Science Certificate Program, aimed at individuals seeking careers in data science and machine learning. It covers foundational topics such as Python and statistics, advanced machine learning techniques, and includes a capstone project for practical application. The program is designed for various experience levels and does not require prior programming or math knowledge.

Uploaded by

kishor bhole
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

EXA Data Science Roadmap (Based on MIT’s

Applied Data Science Certificate Program)


Master the concepts and skills from the MIT Applied Data Science
Certificate Program, entirely for FREE! Our curated guide offers a
structured path to Data careers.

Hi, I’m Jean


I'm the Founder and host of Exaltitude on YouTube. I’ve
worked in tech for the past 20 years as an engineer, an
engineering manager, and a team builder. I was the
19th engineer at WhatsApp and worked with Facebook
as an Engineering Manager for six years after the $19B
acquisition.

Throughout my career, I've mentored and coached


countless Software Engineers and Managers from diverse backgrounds, noticing
common questions around direction and growth: "Where am I headed, and how do I get
there?" This inspired me to share my insights, helping future engineers build purposeful,
successful careers.

Stay connected for updates, industry insights, and career advice on LinkedIn and
YouTube.

Have questions? Reach out anytime on my website!

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


Expected Timeline
While most students complete the MIT AI Graduate Certificate program in 12 weeks, 15
to 18 hours per week. When studying part-time, self-study can take significantly longer.
The amount of time it takes depends on various factors, including:
●​ Your experience level: If you have a strong foundation in math, programming,
and related fields, you may be able to progress faster.
●​ Your dedication and time commitment: The more time you devote to studying,
the quicker you can complete the program.
●​ The depth of your learning: If you want to gain a deep understanding of each
topic, you may need to spend more time.
Set realistic expectations and be patient with yourself. Remember that the goal is to
learn and understand the material, not just to finish the program quickly.

Who is the program for?


●​ Individuals seeking a career transition into Data Science and Machine Learning
●​ Professionals aiming to advance their Data Science and ML leadership skills
●​ Entrepreneurs looking to leverage Data Science and ML for innovative solutions

Prerequisites

No prior programming or math experience is required. We'll start from the basics and
guide you through the entire journey, from understanding data to building complex
machine-learning models.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


Study Guide

Module 1: Foundations
The first module in the program for applied Data Science begins with the foundations,
which cover Python and Statistics foundations.

Part 1: Python

●​ Python is a versatile programming language used for various applications, from


web development to data science and machine learning.

○​ Must learn topics: Arrays and Matrix


■​ An array is a data structure that stores various elements or items at
contiguous memory locations.

■​ A matrix is a two-dimensional (2D) array where data


(elements/items) is stored in the format of rows and columns.

○​ Free Classes: Jean’s Python Roadmap on YouTube


○​ Recommended Book: Automate the Boring Stuff - Chapter 4
●​ Pandas is a commonly used library in Python that is used to analyze and
manipulate data.

○​ Free Classes/Resources: Pandas Tutorial


●​ NumPy is a package in the Python library where you can use this package for
scientific computing to work with arrays.

○​ Classes/Resources: Stanford’s Numpy Tutorial

Part 2: Probability and Statistics

●​ Descriptive Statistics is a method that helps you study data analysis using
multiple data sets by describing and summarizing them. For example, the data

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


set can either be a collection of the population in a neighborhood or the marks a
sample of 100 students achieved.

●​ A Distribution is a statistical function used to report all the probable values that a
random variable takes within a certain range.

●​ Bayes Theorem is a mathematical formula that is named after Thomas Bayes.


This theorem helps you determine conditional probability.​
Inferential statistics is a method that lets you explore basic concepts of using
data for estimation and assess theories with the help of Python.

●​ Free Classes: Khan Academy Probability and Statistics


●​ Recommended Book: A First Course in Probability, by Sheldon Ross, Pearson
(paid resource)

Module 2: Data Analysis and Visualization


This module includes the essential topics on data analysis and visualization.
●​ Visualization is the process of representing data and information in a graphical
form.
○​ Classes:
■​ Data Visualization with Python on Coursera
■​ Introduction to Tableau by DataCamp (paid)
●​ Exploratory Data Analysis (EDA) enables you to uncover patterns and insights
frequently with visual methods within some data.

○​ Classes:
■​ Data Analysis with Python on Coursera
■​ Associate Data Analyst in SQL by DataCamp (paid)

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


💡Advanced Topics: Module 3-5
The next few modules cover advanced topics. While they're not essential for every data
science role, they can be valuable for those who want to specialize in machine learning.
If you're aiming to become a machine learning engineer or data scientist who builds
complex models, then these modules are definitely for you.

If you're more interested in roles like data analyst or data engineer, you might not need
to dive deep into these topics. To learn more about different data roles and their
requirements, check out my newsletter, “Demystifying Data Careers: Your Guide to Data
Analyst vs Scientist vs Data Engineer vs ML Engineer,” where I've outlined the various
career paths in data science.

Module 3: Machine Learning


●​ Introduction to Unsupervised Learning
○​ Unsupervised learning is a technique that helps you analyze and cluster
unlabelled data sets.
○​ Clustering is a technique that clusters or groups data.
○​ Networks: Learn about networks and various examples of a network, like
data as a network versus network to represent dependence among
variables, determine important nodes and edges in a network, and cluster
in a network.
○​ Classes/Resources: Unsupervised Machine Learning on Coursera

●​ Introduction to Supervised Learning


○​ Supervised learning is a technique that helps you analyze and cluster
labeled data sets.
○​ Regression is a statistical technique in machine learning that manages
the relationship between dependent and independent variables with the
help of one or more independent variables.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


○​ Classification, as the name implies, is a procedure to classify/categorize a
data set into various categories. This can be performed on both structured
and unstructured data.
●​ A Decision Tree is a popular supervised machine learning algorithm,
which is used for both classification and regression problems. It is a
hierarchical structure in which the internal nodes denote the dataset
features, branches indicate the decision rules, and each leaf node
represents the result.
●​ Random Forest is another popular supervised machine learning algorithm.
As the name implies, it consists of multiple decision trees on the various
subsets of a given dataset. Then, it calculates the average for
strengthening the predictive accuracy of a dataset.
●​ Time Series​
Time-Series Analysis consists of methods to analyze data on time-series,
which later extracts meaningful statistics and other information.
Time-Series forecasting is a method to predict future values by taking the
help of previously observed values.
○​ Classes/Resources: Supervised Machine Learning: Regression and
Classification on Coursera

Module 4: Deep Learning


Deep Learning is an application of Machine Learning and Artificial Intelligence.
●​ Neural networks are inspired by the human brain, which is used to extract
deep/high-level information from the raw input, like images, objects, etc. This
chapter introduces you to artificial neural networks in deep learning.
●​ Convolutional Neural Networks (CNN) are used for image processing,
segmentation, classification, and several other applications. This chapter helps
you learn all the essential concepts about CNN.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


●​ Transformers are a recent, very successful neural network architecture that
applies to language, graphs, and images. You will learn the basics of this
architecture and see how it can be applied to different types of data.
●​ Classes/Resources:
○​ Practical Deep Learning for Coders on Fast ai
○​ Deep Learning in Python by DataCamp (paid)

Module 5: Recommendations Systems


●​ Recommendation systems help you predict the future preference of some
products, which later recommend the best-suited items to customers.
●​ Matrix factorization is a technique used in recommendation systems to predict
user preferences by decomposing a large user-item rating matrix into smaller
matrices (course 4).
●​ A tensor is a multidimensional array used to represent data with multiple
dimensions (course 3).
●​ Nearest Neighbor Collaborative Filtering is used to recommend items based on
the preferences of similar users (course 2).
●​ Classes/Resources:
○​ Recommender Systems Specialization on Coursera
○​ Building Recommendation Engines in Python by DataCamp (paid)

Module 6: Capstone Project

Examples of Capstone Projects from MIT


●​ Banking, Financial Services, and Insurance (BFSI) - Loan Default Prediction:
○​ Build a classification model to predict clients who are likely to default on
their loans. Give recommendations to the bank on important features to
consider while approving a loan.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


○​ Concepts Used: Logistic Regression, Decision Trees, Random Forests, and
Ensemble Methods
●​ Research - Facial Emotion Detection:
○​ Use Deep Learning and AI techniques to create a Computer Vision model
that can accurately detect facial emotions. The model should be able to
perform multi-class classification on images of facial expressions and
categorize them according to the associated emotion.
○​ Concepts Used: Artificial Neural Networks, Convolution Neural Networks,
Computer Vision, Transfer Learning, and CNN Regularization
●​ Healthcare - Malaria Detection:
○​ Detect whether Red Blood Cells (RBCs) are infected with malaria using
Image Classification Techniques
○​ Concepts Used: Image Classification and Convolutional Neural Networks
●​ Real Estate - AI-Powered Boston House Price Prediction:
○​ Predicting house prices in the Boston metropolitan area based on features
of the property and its locality using Regression techniques
○​ Concepts Used: Linear Regression, Logistic Regression and K-Nearest
Neighbors
●​ Marketing - Predicting Potential Customers:
○​ Identifying leads who are more likely to convert to paid customers based
on their attributes and interaction details
○​ Concepts Used: Decision Trees and Random Forests
●​ Retail - Amazon AI Product Recommendation System
○​ Recommending the best Amazon products available to users based on
past rating data using AI-driven recommendation techniques
○​ Concepts Used: Rank-Based, Similarity-Based, Matrix Factorization-Based,
and Content-Based Recommendation Systems

Utilizing ChatGPT
ChatGPT can be a valuable tool for creating detailed project plans.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


●​ Prompt:
○​ Tell me how I can approach building this project in [insert project
description.] Give me practical hands-on resources and a step-by-step
guide.
●​ Sample Prompt:
○​ Tell me how I can approach building this project in Banking, Financial
Services, and Insurance (BFSI) - Loan Default Prediction: Build a
classification model to predict clients who are likely to default on their
loans. Give recommendations to the bank on important features to
consider while approving a loan. Give me practical hands-on resources
and step-by-step guide.
●​ Sample Response:

Building a Loan Default Prediction Model: A Step-by-Step Guide


1. Data Collection and Preparation:

●​ Gather Relevant Data: Collect historical loan data, including


information on demographics (age, gender, occupation), credit
history (credit score, payment history), financial behavior (income,
expenses, debt), and loan details (loan amount, interest rate,
tenure).

●​ Data Cleaning: Handle missing values (imputation or removal),


outliers (capping or flooring), and inconsistencies in the data.

●​ Feature Engineering: Create new features that might be predictive


of default, such as:

○​ Debt-to-income ratio
○​ Credit utilization rate
○​ Payment history (e.g., number of late payments, missed
payments)

○​ Employment stability

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


○​ Economic indicators (e.g., GDP growth, interest rates)
●​ Data Splitting: Divide the dataset into training and testing sets.
The training set will be used to train the model, and the testing set
will be used to evaluate its performance.
2. Exploratory Data Analysis (EDA):

●​ Understand the Data: Explore the distribution of numerical


variables (e.g., histograms, box plots) and the frequency of
categorical variables (e.g., bar charts).

●​ Identify Relationships: Visualize the relationships between


different variables using scatter plots, correlation matrices, and
pair plots.

●​ Handle Imbalance: If the dataset is imbalanced (more non-default


cases than default cases), consider techniques like oversampling,
undersampling, or class weighting.
3. Feature Selection:

●​ Identify Important Features: Use techniques like correlation


analysis, feature importance from tree-based models (e.g.,
decision trees, random forests), or statistical tests (e.g.,
chi-square test, ANOVA) to select the most relevant features.

●​ Reduce Dimensionality: If necessary, use techniques like Principal


Component Analysis (PCA) to reduce the number of features
while preserving most of the information.
4. Model Selection and Training:

●​ Choose a Suitable Model: Consider models like:


○​ Logistic Regression: A simple yet effective model for
binary classification.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


○​ Decision Trees: Interpretable models that can handle both
numerical and categorical data.

○​ Random Forest: Ensemble models that combine multiple


decision trees to improve accuracy.

○​ XGBoost: A powerful gradient boosting algorithm that


often achieves high accuracy.

○​ Neural Networks: Complex models that can capture


intricate patterns in the data.

●​ Train the Model: Use the training data to train the selected model.
●​ Hyperparameter Tuning: Optimize the model's performance by
tuning hyperparameters like learning rate, number of trees, and
maximum depth.
5. Model Evaluation:

●​ Evaluate Performance: Use metrics like accuracy, precision, recall,


F1-score, ROC curve, and confusion matrix to assess the model's
performance on the testing set.

●​ Identify Areas for Improvement: Analyze the model's predictions


to identify misclassified cases and potential areas for
improvement.
6. Model Deployment and Monitoring:

●​ Deploy the Model: Integrate the model into the bank's credit
decisioning system to predict default probabilities for new loan
applications.

●​ Monitor Performance: Continuously monitor the model's


performance and retrain it as needed to adapt to changing
conditions.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


●​ Explainable AI: Use techniques like SHAP or LIME to explain the
model's predictions and identify the most important factors
influencing the decision.
Hands-on Resources:
Online Courses:

●​ Coursera:
○​ Machine Learning by Andrew Ng
○​ Data Science Specialization by UC Irvine
●​ edX:
○​ MicroMasters Program in Statistics and Data Science
●​ Kaggle: Explore datasets and notebooks related to loan default
prediction.

●​ YouTube:
○​ StatQuest, Sentdex, 3Blue1Brown
Python Libraries:

●​ Pandas: Data manipulation and analysis


●​ NumPy: Numerical computations
●​ Scikit-learn: Machine learning algorithms
●​ Matplotlib and Seaborn: Data visualization
Additional Tips:

●​ Feature Engineering: Create informative features like


debt-to-income ratio, credit utilization ratio, and payment history.

●​ Handling Imbalance: Use techniques like oversampling,


undersampling, or class weighting.

●​ Model Selection: Experiment with different models and tune


hyperparameters.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


●​ Model Evaluation: Use appropriate metrics like accuracy,
precision, recall, F1-score, and ROC curve.

●​ Explainable AI: Use techniques like SHAP or LIME to understand


the model's decisions.

●​ Continuous Monitoring: Monitor the model's performance and


retrain it as needed.
By following these steps and leveraging the provided resources, you can
build a robust loan default prediction model that can help banks make
informed decisions and reduce credit risk.

Tips for the project


Here are some steps to help you develop a plan:

●​ Identify your interests: What areas of AI and ML interest you the most? This will
help you narrow down your project topic.

●​ Choose a problem: Once you've identified your interests, start brainstorming


potential problems you can solve using AI and ML. Look for problems that are
challenging but achievable.

●​ Define your project's scope: Clearly define your project's goals and objectives.
What do you want to achieve? What are the key questions you want to answer?

●​ Gather data: Collect the data you'll need to train and evaluate your AI or ML
model. While using preprepared datasets (such as Kaggle) is fine, it's important
to explore and analyze the data, including preprocessing and error analysis, to
fully understand the problem.

●​ Choose an algorithm or model: Select the appropriate AI or ML algorithm or


model for your project. Consider the nature of your data and the problem you're
trying to solve.

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]


●​ Implement your model: Use a programming language like Python and libraries
like TensorFlow or PyTorch to implement your AI or ML model.

●​ Train and evaluate your model: Train it on your data and assess its performance
using appropriate metrics.

●​ Iterate and improve: If your model is not performing as well as you'd like, iterate
on your approach and make improvements.

●​ Present your results: Create a presentation or report summarizing your project,


the methods you used, and your results.

Ultimately, the best project for you will depend on your specific interests and career
goals. Consider your previous coursework, your strengths and weaknesses, and the
areas of AI that excite you the most.

Additional Tips:

●​ Consistency is vital: Dedicate a specific time each day for studying.


●​ Take breaks: Avoid burnout by taking short breaks.
●​ Join online communities: Connect with other learners for support and
collaboration.

●​ Build projects: Apply your knowledge by creating small projects.


●​ Stay motivated: Set achievable goals and celebrate your progress.

Remember, everyone learns at their own pace. Keep practicing, and you'll improve!

Good luck!

www.exaltitude.io ● www.youtube.com/@exaltitude ● [email protected]

You might also like