Roadmap
Roadmap
Note: This is how I got into the field of data science and machine learning. So this document
should not be considered as a standard way to approach the subject. Here is the overall view
of the steps towards being a data scientist/machine learning engineer. The ones written in blue
are more advanced and not critical.
I have also added a few points about each part in the subsequent sections. Also you could find
a motivation behind some parts in the remarks section.
1 Roadmap
The basic steps to become a data scientist:
• Start with basic understanding of Python and R programming languages
• Data retrieval techniques; Become familiar with databases (relational and non relational
and graph databases) and their query languages (SQL, and the query language of noSQL
databases like mongoDB, Elasticsearch, Cypher and so on)
• Data models and data structures
• Learn as much maths as you can
• Basic machine learning algorithms (the supervised and unsupervised algorithms, ....)
• dive into a real world problem and learn how to translate a business problem into the
corresponding data scientific problem and translate the findings back into the business
insight
• Hands on advanced programming with Python (implementation of different ML algo-
rithms, especially implementing different neural networks using pytorch and/or tensor-
flow)
• Learn advanced mathematical techniques related to each specific problem
• Reinforcement Learning
Data Science 1
1.2 Databases
1.2 Databases
I am not an expert in this area, so this section is very elementary and covers what you need to
know about databases and how you could query on them.
Learning SQL of some kind is pretty forward. These are some flavors you could try: MySQL,
MSSQL, postGreSQL. But the skeleton of all is the same with a few differences.
For document based databases like MongoDB, I strongly suggest that you go through the stuff
provided by the official website: MongoDB
The same story holds for Elasticsearch. The best reference is the one presented by Elasticsearch
∗∗∗
The graph databases are a little different, They store not only data points as entities, but also
are capable of storing the relations between them. Neo4j is a good example of the databases of
this kind (Neo4j query language)
Data Science 2
1.3 Mathematics
Neo4j is not the only one you could work on, Oracle is also capable of storing data with the
desired structure, but as far as I know, many of the advanced Python libraries for graph learning
problems (like NetworkX) are compatible with Neo4j.
In the process of learning elements of data retrieval and getting familiar with data structures,
make sure that you understand the difference between structured and unstructured data; and
also you know how each database store and index data.
Keep in mind that as a data scientist, you need to deep dive into these subjects. But as a part
of data manipulation you will have to know how to get your data in an efficient way.
The list I have presented above is not comprehensive in any manner. But These are the main
references of different databases which store data in different ways. But if you know how they
work, it is pretty much guaranteed that you could work with any other database.
1.3 Mathematics
Almost all problems in the field of artificial intelligence and machine learning are mathematical
problems, so basically to enter the field you must have a good grasp on different mathematical
subjects. As a matter of fact, learn as much maths as you can. Especially when you encounter
various AI/ML algorithms, it is usually crucial to know how they work and what mathematical
concepts are used to derive them. This helps you not only choose best algorithms for attacking
a problem, but also equipped you with ways of how the algorithms could be optimized and even
gives you the ability to create new algorithms from the existing ones.
The main mathematical concepts you must be familiar with are listed below. The ones written
in red are more advanced and require more time and effort. But I have attached some references
to help you get onboard with them and understand the motivation behind them.
Also I have attached the slides I have taught before to see an overview of the concepts.
• Probability theory (Frequentist view and Bayesian view) and Bayesian inference. Also
knowing a few concepts from statistical analysis like testing the hypothesis is necessary
(slides)
• Linear Algebra (slides)
• Optimization methods (slides)
Data Science 3
1.4 Machine Learning
Data Science 4
1.5 Artificial Intelligence
• parameter learning methods - deeper understanding of loss functions and avoiding over-
fitting and underfitting
♠ Unsupervised Learning
• Basic clustering algorithms like K-means, self organizing maps (SOM’s) and other ones
• Embedding methods (like word2vec, doc2vec, etc)
• advanced clustering/data visualisation methods like t-sne and umap
Again the list is not comprehensive, but in my opinion, every data scientist must know all of
them.
Data Science 5
1.6 Reinforcement Learning
fig 2: How an agent interacts with the environment and takes actions.
Reinforcement learning1 (RL) in the branch of AI where the machine learns to solve compli-
cated problems (like winning a game or put the best advertisement on a webpage or trade in
stock market) solely on its own. In fact, there is no explicit learning instruction here and the
machine learns by interacting with the environment and observes the rewards it gets through
this interaction. In recent years, reinforcement learning gained lots of attention, as it could
solve a variety of different problems. So I encourage you to have a look at the following course
to see more information on the subject: Google’s deepmind course instructed by David Silver:
Introduction to Reinforcement Learning.
1
This is more advanced, but I strongly recommend you to go over the material. The theory behind Rein-
forcement learning is very beautiful and the applications are fun!!
Data Science 6