How To Self Learn Data Science in 2022
How To Self Learn Data Science in 2022
As someone who don’t hold a degree in data science, I am truly passionate about this field and decided to
experiment on building my own curriculum to self learn data science in spare time. I would like to share my
experience and hope to bring some insights if you want to share the same journey.
Project based learning is a good starting point for people already have some technical background but also
want to explore the building blocks of data science. A typical data science / machine learning project comprises
a lifecycle - from defining the objectives, data preprocessing, exploratory data analysis, feature engineering,
https://www.visual-design.net/post/how-to-self-learn-data-science 1/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
model implementation to model evaluation. Each phase requires different skillsets, mainly statistics,
programming, SQL, data visualization, mathematics and business knowledge.
I highly recommend Kaggle as the platform to experiment with your data science projects. With plenty of
interesting datasets and a cloud based programming environment, you can easily get data source, code and
notebooks from Kaggle for free. As a reader/writer on Medium , I also recommend using the platform to gain
data science knowledge from professionals and share your own project all at the same place.
1. It is practical and gives us a sense of achievement that we are doing something real!
2. It highlights the rationale of learning each pieces of content. This goal-oriented approach provides a bird
eye view of how each little pieces work together to form the big picture
3. It allow us to actively retrieve the information as we are learning. “Active Recall” is proven to significantly
enhance information retention, compared to conventional learning mechanism which only requires
passively consuming knowledge.
Let's break down the project lifecycle into following 5 steps and we will see how each step connects to various
knowledge domain.
translate a business requirement to a technical solution. It requires years of experience in the field to build up
your knowledge. I can only recommend some websites that increase your exposure to some business
domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch. Additionally, I
recommend the book "Data Science for Business" as an integrated view of data science and business.
https://www.visual-design.net/post/how-to-self-learn-data-science 2/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
search for a model in the hypothesis space that best fits our observed data, and allows us to make prediction
to unobserved data.
Useful Resource:
Skillset - SQL
SQL is a powerful language for communicating with and extracting data from structured database.
Additionally, learning SQL also assists with framing a mental model that helps you to generate insights through
data querying techniques, such as grouping, filtering, sorting, and joining. You will also find similar logics
https://www.visual-design.net/post/how-to-self-learn-data-science 3/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
unavoidable if you use python for data extraction. It transforms database into dataframe - a table like format
that we are most familiar with.In the stage of data preprocessing, it is required to examine and address
following data quality issues, and these can all be done using Pandas.
Useful Resources:
correlation, covariance
distribution
After a solid understanding of the dataset characteristics, it requires to apply the most appropriate feature
engineering techniques accordingly. For instance, use log transformation for right-skewed data and clipping
methods to deal with outliers. Here I list some common feature engineering techniques:
categorical encoding
https://www.visual-design.net/post/how-to-self-learn-data-science 4/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
scaling
imputation
feature selection
Useful Resource:
bar chart
histogram
box plot
heatmap
scatter plot
...
If interested, feel free to check out my articles on EDA and data visualization:
4. Model Implementation
After all of the preparation so far, it’s finally the time to dive deeper into machine learning algorithms.
https://www.visual-design.net/post/how-to-self-learn-data-science 5/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
be able to distinguish the best use case for each. Generally, machine learning algorithms are categorized into
supervised learning and unsupervised learning. Below are some of the most popular algorithms:
Supervised Learning:
Linear Regression
Logistic Regression
Neural Network
Decision Tree
K-Nearest Neighbour
Unsupervised Learning:
Clustering
PCA
Dimension Reduction
I have created notebook and code snippet for machine learning algorithms. If you are interested, check it out:
Code Snippet
Notebook
https://www.visual-design.net/post/how-to-self-learn-data-science 6/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
scikit-learn website
Skillset - Math
Many starters including me may have the question of why we need to learn Math in data science. As a
beginner, math knowledge mainly assists in understanding the underlying theory behind the algorithms.
Moving forward, when we no longer rely on built in libraries for building machine learning models, it allows us to
develop and optimize advanced algorithms. Additionally, hyperparameter tuning also involves advanced math
knowledge for searching the best model that minimize the cost function.
This is when more complicated math topics come into place:
calculus
linear algebra
optimization problem
gradient descent
searching algorithms
Useful Resources:
https://www.visual-design.net/post/how-to-self-learn-data-science 7/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
5. Model Evaluation
Skillset - Statistics (Inferential Statistics)
Inferential Statistics is particular useful when making model prediction and evaluating model performance. As
opposed to descriptive statistics, inferential statistics focuses on generalizing the pattern observed in the
sample data to a wider population. It provides evidence of which features have the high importance in making
Confusion matrix
Accuracy
ROC / AUC
Whereas, for regression problem where the output is continuous numbers, some common metrics are:
R Squared
Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE)
Useful Resources
https://www.visual-design.net/post/how-to-self-learn-data-science 8/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022
Take-Home Message
It is a general guide that documents the learning journey I followed through, so I hope this can help some
starters that are also passionate about data science and would like to invest their spare time exploring this
field. Most topics I mentioned in the article are at surface level, and allows you to choose a field to dig deeper
based on your own preference. If you find it helpful and would like to read more articles like this, please support
by signing up Premium Membership.
https://www.visual-design.net/post/how-to-self-learn-data-science 9/9