0% found this document useful (0 votes)

35 views9 pages

How To Self Learn Data Science in 2022

Uploaded by

priteshbari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views9 pages

How To Self Learn Data Science in 2022

Uploaded by

priteshbari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Jan 30, 2022 6 min read

How to Self Learn Data Science in 2022

Updated: Apr 17, 2022

A Project-Based Approach to Get Started in Data Science

grab the cheatsheet from our infographics gallery

As someone who don’t hold a degree in data science, I am truly passionate about this field and decided to
experiment on building my own curriculum to self learn data science in spare time. I would like to share my
experience and hope to bring some insights if you want to share the same journey.

Project based learning is a good starting point for people already have some technical background but also
want to explore the building blocks of data science. A typical data science / machine learning project comprises
a lifecycle - from defining the objectives, data preprocessing, exploratory data analysis, feature engineering,

https://www.visual-design.net/post/how-to-self-learn-data-science 1/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

model implementation to model evaluation. Each phase requires different skillsets, mainly statistics,
programming, SQL, data visualization, mathematics and business knowledge.

I highly recommend Kaggle as the platform to experiment with your data science projects. With plenty of
interesting datasets and a cloud based programming environment, you can easily get data source, code and
notebooks from Kaggle for free. As a reader/writer on Medium , I also recommend using the platform to gain

data science knowledge from professionals and share your own project all at the same place.

Why Project Based Approach?

1. It is practical and gives us a sense of achievement that we are doing something real!

2. It highlights the rationale of learning each pieces of content. This goal-oriented approach provides a bird
eye view of how each little pieces work together to form the big picture

3. It allow us to actively retrieve the information as we are learning. “Active Recall” is proven to significantly
enhance information retention, compared to conventional learning mechanism which only requires
passively consuming knowledge.

Let's break down the project lifecycle into following 5 steps and we will see how each step connects to various
knowledge domain.

1.Business Problem & Data Science Solution

The first step of a data science project is to identify the business problem and define the objectives of an

experiment design or model deployment.

Skillset - Business Knowledge

At this stage, it doesn’t need technicals skill yet but demands business understanding to identify the problem
and define the objectives. First to understand the domain specific terminology appeared in the dataset, then to

translate a business requirement to a technical solution. It requires years of experience in the field to build up
your knowledge. I can only recommend some websites that increase your exposure to some business
domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch. Additionally, I

recommend the book "Data Science for Business" as an integrated view of data science and business.

https://www.visual-design.net/post/how-to-self-learn-data-science 2/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Skillset - Statistics (Experiment Design)

After defining the problem, then it is to frame it into a data science solution. This starts with the knowledge in
Experiment Design such as hypothesis testing, sampling, bias / variances, different types of errors, overfitting /
underfitting.
In the article "An Interactive Guide to Hypothesis Testing in Python", I introduced various type of statistical
testing - t test, ANOVA, Chi Square test etc.
Machine Learning fundamentally can be considered as a hypothesis testing process, where we needs to

search for a model in the hypothesis space that best fits our observed data, and allows us to make prediction
to unobserved data.
Useful Resource:

Khan Academy: Study Design

A Gentle Introduction to Statistical Hypothesis Testing

Probability for Statistics and Machine Learning

2. Data Extraction & Data Preprocessing

The second step is to collect data from various sources and transform the raw data into digestible format.

Skillset - SQL
SQL is a powerful language for communicating with and extracting data from structured database.

Additionally, learning SQL also assists with framing a mental model that helps you to generate insights through
data querying techniques, such as grouping, filtering, sorting, and joining. You will also find similar logics

appearing in other languages, such as Pandas and SAS.

Useful Resources:

“Get Started with SQL Joins”

Datacamp: SQL fundamentals

Dataquest: SQL Basics

Skillset - Python (Pandas)

It is essential to get comfortable with a programing language. The simple syntax makes Python a relatively
easy language to start with. Here is a great video tutorial if you are new to Python: Python for Beginners -

https://www.visual-design.net/post/how-to-self-learn-data-science 3/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Learn Python in 1 Hour.

After a basic understanding, it is worth spending some time to learn Pandas library. Pandas is almost

unavoidable if you use python for data extraction. It transforms database into dataframe - a table like format
that we are most familiar with.In the stage of data preprocessing, it is required to examine and address
following data quality issues, and these can all be done using Pandas.

address missing data

transform inconsistent data type

remove duplicated value

Useful Resources:

Python Pandas Tutorial: A Complete Introduction for Beginners

W3schools: Pandas Tutorial

Python for Data Science for Dummies

3. Data Exploration & Feature Engineering

The third step is data exploration, also known as EDA (exploratory data analysis) which reveals hidden
characteristics and pattern in a dataset. It usually involves data visualization techniques, and followed by

feature engineering to transform data based on the results of exploration.

Skillset - Statistics (Descriptive Statistics)

Data exploration use descriptive statistics to summarize characteristics of the dataset

mean, median, mode

standard deviation, skewness

correlation, covariance

distribution

After a solid understanding of the dataset characteristics, it requires to apply the most appropriate feature

engineering techniques accordingly. For instance, use log transformation for right-skewed data and clipping
methods to deal with outliers. Here I list some common feature engineering techniques:

categorical encoding

https://www.visual-design.net/post/how-to-self-learn-data-science 4/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

scaling

imputation

feature selection

Useful Resource:

3 Common Techniques for Data Transformation

Fundamental Techniques of Feature Engineering for Machine Learning

Feature Selection and EDA in Machine Learning

Skillset - Data Visualization

Combining statistics and data visualization allows us to understand the data through appropriate visual
representation. Whether you prefer using visualization package such as seaborn or matplotlib in Python and
ggplot2 in R; or visualization tools like Tableau and PowerBI, it’s essential to understand the use case of

different chart types:

bar chart

histogram

box plot

heatmap

scatter plot

...

If interested, feel free to check out my articles on EDA and data visualization:

Semi-Automated Exploratory Data Analysis (EDA) in Python

How to Choose the Most Appropriate Chart?

Dashboard Design Principle

4. Model Implementation
After all of the preparation so far, it’s finally the time to dive deeper into machine learning algorithms.

https://www.visual-design.net/post/how-to-self-learn-data-science 5/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Skillset - Machine Learning

scikit-learn is a powerful Python library that allows beginners to get started in machine learning easily. It offers
plenty of built-in functions and we can implement a model using several lines of code. Although it has already
done the hard work for us, it is still crucial to understanding how the algorithms operate behind the scene and

be able to distinguish the best use case for each. Generally, machine learning algorithms are categorized into
supervised learning and unsupervised learning. Below are some of the most popular algorithms:
Supervised Learning:

Linear Regression

Logistic Regression

Neural Network

Decision Tree

Support Vector Machine

K-Nearest Neighbour

Unsupervised Learning:

Clustering

PCA

Dimension Reduction

I have created notebook and code snippet for machine learning algorithms. If you are interested, check it out:

Code Snippet

Notebook

https://www.visual-design.net/post/how-to-self-learn-data-science 6/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Other Useful Resources:

scikit-learn website

Coursera: Machine Learning with Python

Skillset - Math
Many starters including me may have the question of why we need to learn Math in data science. As a
beginner, math knowledge mainly assists in understanding the underlying theory behind the algorithms.

Moving forward, when we no longer rely on built in libraries for building machine learning models, it allows us to
develop and optimize advanced algorithms. Additionally, hyperparameter tuning also involves advanced math
knowledge for searching the best model that minimize the cost function.
This is when more complicated math topics come into place:

calculus

linear algebra

optimization problem

gradient descent

searching algorithms

Useful Resources:

3Blue1Brown: Essence of Linear Algebra

https://www.visual-design.net/post/how-to-self-learn-data-science 7/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

3Blue1Brown: Essence of Calculus

3Blue1Brown: Gradient Descent

5. Model Evaluation
Skillset - Statistics (Inferential Statistics)
Inferential Statistics is particular useful when making model prediction and evaluating model performance. As
opposed to descriptive statistics, inferential statistics focuses on generalizing the pattern observed in the
sample data to a wider population. It provides evidence of which features have the high importance in making

inference. Also it determines the model performance based on evaluation metrics.

For example, for classification problem where the output is discrete category, some common metrics are:

Confusion matrix

Type 1 error / Type 2 error

Accuracy

ROC / AUC

Whereas, for regression problem where the output is continuous numbers, some common metrics are:

R Squared

Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE)

Useful Resources

Khan’s Academy: Statistics and Probability

Metrics to Evaluate your Machine Learning Algorithm

https://www.visual-design.net/post/how-to-self-learn-data-science 8/9
11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Take-Home Message

It is a general guide that documents the learning journey I followed through, so I hope this can help some
starters that are also passionate about data science and would like to invest their spare time exploring this
field. Most topics I mentioned in the article are at surface level, and allows you to choose a field to dig deeper

based on your own preference. If you find it helpful and would like to read more articles like this, please support
by signing up Premium Membership.

https://www.visual-design.net/post/how-to-self-learn-data-science 9/9

A I For Children
100% (1)
A I For Children
140 pages
Data Science Crash Course - SharpSight PDF
100% (3)
Data Science Crash Course - SharpSight PDF
107 pages
Data Science
100% (2)
Data Science
33 pages
ML Interview Questions
No ratings yet
ML Interview Questions
146 pages
Ebook Data Science
100% (3)
Ebook Data Science
48 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Getting Started With Data Science Using Python
100% (1)
Getting Started With Data Science Using Python
25 pages
Data Science Roadmap
100% (1)
Data Science Roadmap
9 pages
Data Science
No ratings yet
Data Science
18 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
Data Scientist Roadmap 2025-26
No ratings yet
Data Scientist Roadmap 2025-26
32 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Roadmap Geeksforgeeks
No ratings yet
Roadmap Geeksforgeeks
24 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
62 pages
Bargauwdatasciencelecture2 160424211445
No ratings yet
Bargauwdatasciencelecture2 160424211445
137 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Syallbus
No ratings yet
Data Science Syallbus
14 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Important Links in Data Science
No ratings yet
Important Links in Data Science
3 pages
Course Content
No ratings yet
Course Content
72 pages
Data Science Minimum
No ratings yet
Data Science Minimum
9 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Data Science
No ratings yet
Data Science
14 pages
Self Learning Material - Introduction To Data Science
No ratings yet
Self Learning Material - Introduction To Data Science
10 pages
FIT1043 - Lecture 1 - 2024 Data Science
No ratings yet
FIT1043 - Lecture 1 - 2024 Data Science
66 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Science RoadMap Min
No ratings yet
Data Science RoadMap Min
27 pages
Data Science
No ratings yet
Data Science
14 pages
Unit I
No ratings yet
Unit I
52 pages
Unit 1
No ratings yet
Unit 1
21 pages
TRAINING Report
No ratings yet
TRAINING Report
32 pages
Data Science Diary
No ratings yet
Data Science Diary
10 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
10 Things Know Before First Data Science Project
No ratings yet
10 Things Know Before First Data Science Project
8 pages
File
No ratings yet
File
27 pages
Unit 3
No ratings yet
Unit 3
9 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
Guide To Learning Data Science - A Beginner's Resource
No ratings yet
Guide To Learning Data Science - A Beginner's Resource
4 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Guide To Data Science
No ratings yet
Guide To Data Science
2 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
The Ultimate Learning Path To Become A Data Scientist and Master Machine Learning in 2019
No ratings yet
The Ultimate Learning Path To Become A Data Scientist and Master Machine Learning in 2019
12 pages
Data Science
No ratings yet
Data Science
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
12 pages
Data Science
No ratings yet
Data Science
13 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
2 pages
c15732d c4d6 Af31 d18 d56f0f8f5675 Machine Learning Roadmap
No ratings yet
c15732d c4d6 Af31 d18 d56f0f8f5675 Machine Learning Roadmap
25 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Full Detailed I Need
No ratings yet
Full Detailed I Need
7 pages
Krisp - Summary 10
No ratings yet
Krisp - Summary 10
2 pages
Advanced Data Science and AI Brochure
No ratings yet
Advanced Data Science and AI Brochure
51 pages
Comprehensive AI & ML Course - From Beginner To Gen...
No ratings yet
Comprehensive AI & ML Course - From Beginner To Gen...
5 pages
PHD Thesis On Business Intelligence
100% (2)
PHD Thesis On Business Intelligence
8 pages
Lecture 3 - Introduction To Deep Learning
No ratings yet
Lecture 3 - Introduction To Deep Learning
27 pages
Final 2017 Scheme Odd Sem Co PO Mapping
No ratings yet
Final 2017 Scheme Odd Sem Co PO Mapping
12 pages
ETI-CH1 Notes
No ratings yet
ETI-CH1 Notes
19 pages
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
No ratings yet
Learning With Fractional Orthogonal Kernel Classifiers in Support Vector Machines
312 pages
Spring 2023 Batch
No ratings yet
Spring 2023 Batch
8 pages
CSE445 1 Intro To ML
No ratings yet
CSE445 1 Intro To ML
36 pages
Tushar Internship Report 4th Year
No ratings yet
Tushar Internship Report 4th Year
17 pages
Facial Emotion Detection
No ratings yet
Facial Emotion Detection
20 pages
Techniques of Text Detection and Extraction
No ratings yet
Techniques of Text Detection and Extraction
18 pages
e Flux Criticism Value in Garbage Out On Ai Art and Hegemony
No ratings yet
e Flux Criticism Value in Garbage Out On Ai Art and Hegemony
7 pages
Introduction To ML Linear Regression
No ratings yet
Introduction To ML Linear Regression
33 pages
Quantum Machine Learning
No ratings yet
Quantum Machine Learning
8 pages
Innovative Product Design
No ratings yet
Innovative Product Design
21 pages
Mlp-Fromscratch Sigmoid-Mse
No ratings yet
Mlp-Fromscratch Sigmoid-Mse
13 pages
Lab 10 - Neural Network
No ratings yet
Lab 10 - Neural Network
11 pages
Data Science Methodologies: Current Challenges and Future Approaches
No ratings yet
Data Science Methodologies: Current Challenges and Future Approaches
22 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
10 21105 Joss 06143
No ratings yet
10 21105 Joss 06143
7 pages
Get Machine Learning Systems Designs That Scale 1617293334 Free All Chapters
100% (2)
Get Machine Learning Systems Designs That Scale 1617293334 Free All Chapters
24 pages
WInd Turbine Research Proposal
No ratings yet
WInd Turbine Research Proposal
4 pages
Face Transformer For Recognition
No ratings yet
Face Transformer For Recognition
5 pages
Adelor AI
No ratings yet
Adelor AI
7 pages
Yehya Abouelnaga - San Francisco Crime Classification
No ratings yet
Yehya Abouelnaga - San Francisco Crime Classification
3 pages

How To Self Learn Data Science in 2022

Uploaded by

How To Self Learn Data Science in 2022

Uploaded by

11/4/23, 11:15 PM How to Self Learn Data Science in 2022

Jan 30, 2022 6 min read

How to Self Learn Data Science in 2022

A Project-Based Approach to Get Started in Data Science

grab the cheatsheet from our infographics gallery

Why Project Based Approach?

1.Business Problem & Data Science Solution

experiment design or model deployment.

Skillset - Business Knowledge

Skillset - Statistics (Experiment Design)

Khan Academy: Study Design

A Gentle Introduction to Statistical Hypothesis Testing

Probability for Statistics and Machine Learning

2. Data Extraction & Data Preprocessing

appearing in other languages, such as Pandas and SAS.

“Get Started with SQL Joins”

Datacamp: SQL fundamentals

Dataquest: SQL Basics

Skillset - Python (Pandas)

Learn Python in 1 Hour.

address missing data

transform inconsistent data type

remove duplicated value

Python Pandas Tutorial: A Complete Introduction for Beginners

W3schools: Pandas Tutorial

Python for Data Science for Dummies

3. Data Exploration & Feature Engineering

feature engineering to transform data based on the results of exploration.

Skillset - Statistics (Descriptive Statistics)

mean, median, mode

standard deviation, skewness

3 Common Techniques for Data Transformation

Fundamental Techniques of Feature Engineering for Machine Learning

Feature Selection and EDA in Machine Learning

Skillset - Data Visualization

different chart types:

Semi-Automated Exploratory Data Analysis (EDA) in Python

How to Choose the Most Appropriate Chart?

Dashboard Design Principle

Skillset - Machine Learning

Support Vector Machine

Other Useful Resources:

Coursera: Machine Learning with Python

3Blue1Brown: Essence of Linear Algebra

3Blue1Brown: Essence of Calculus

3Blue1Brown: Gradient Descent

inference. Also it determines the model performance based on evaluation metrics.

Type 1 error / Type 2 error

Khan’s Academy: Statistics and Probability

Metrics to Evaluate your Machine Learning Algorithm

You might also like