0% found this document useful (0 votes)
27 views6 pages

Roadmap

Global Grading System

Uploaded by

m.akbari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Roadmap

Global Grading System

Uploaded by

m.akbari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Roadmap

Note: This is how I got into the field of data science and machine learning. So this document
should not be considered as a standard way to approach the subject. Here is the overall view
of the steps towards being a data scientist/machine learning engineer. The ones written in blue
are more advanced and not critical.

I have also added a few points about each part in the subsequent sections. Also you could find
a motivation behind some parts in the remarks section.

1 Roadmap
The basic steps to become a data scientist:
• Start with basic understanding of Python and R programming languages
• Data retrieval techniques; Become familiar with databases (relational and non relational
and graph databases) and their query languages (SQL, and the query language of noSQL
databases like mongoDB, Elasticsearch, Cypher and so on)
• Data models and data structures
• Learn as much maths as you can
• Basic machine learning algorithms (the supervised and unsupervised algorithms, ....)
• dive into a real world problem and learn how to translate a business problem into the
corresponding data scientific problem and translate the findings back into the business
insight
• Hands on advanced programming with Python (implementation of different ML algo-
rithms, especially implementing different neural networks using pytorch and/or tensor-
flow)
• Learn advanced mathematical techniques related to each specific problem
• Reinforcement Learning

1.1 Programming with Python and R


Although Python is the main programming language for data analysis and machine learning,
for statistical learning tasks R usually outperforms Python. So you should learn how to use R
as well.
Both these languages are pretty easy to learn. That should take you less than a month to get
familiar with the elements of both programming languages.
To learn Python and R, I suggest you look at one of the courses in Udacity; They always use
good teachers and the courses are very efficient in covering most important materials you need.
But if you use a course like this or read a book to learn Python/R, make sure that you have
learned the following subjects:

Data Science 1
1.2 Databases

• basic programs with Python/R and different operators


• The notion of classes (which is not as comprehensive as languages like Java, or Scala;
mainly because they have been attached to Python afterwards)
• Data manipulation libraries like dplyr (R) and Pandas (Python)
• Data visualization techniques and libraries that help you visualize data; like matplotlib,
ploty, seaborn, etc. This part is essential, especially if you want to be a good data analyst.
• libraries that help you connect to different databases, like psycopg2, pymongo and elas-
ticsearch
• Working with matrices and arrays: Numpy library
For the deep learning part, I have attached one of introductory courses of Udacity on deep
learning using pytorch. The github projects of this course could be found on this repository:
deep learning with pytorch. You need the latter part when you go over the advanced program-
ming for machine learning and AI. So do not go over these ones before learning the basics of
implementing different ML algorithms.
But keep this in mind that you do not actually need to learn these in a complete manner, if
you miss some points, do not worry. You only have to know the pillars of these programming
languages. The actual learning comes after you evolve in real projects. Then you have to use
these thousands of times, and you will eventually get onboard with all of these materials.

1.2 Databases
I am not an expert in this area, so this section is very elementary and covers what you need to
know about databases and how you could query on them.
Learning SQL of some kind is pretty forward. These are some flavors you could try: MySQL,
MSSQL, postGreSQL. But the skeleton of all is the same with a few differences.
For document based databases like MongoDB, I strongly suggest that you go through the stuff
provided by the official website: MongoDB
The same story holds for Elasticsearch. The best reference is the one presented by Elasticsearch

∗∗∗
The graph databases are a little different, They store not only data points as entities, but also
are capable of storing the relations between them. Neo4j is a good example of the databases of
this kind (Neo4j query language)

Data Science 2
1.3 Mathematics

fig 1: How a graph database stores and retrieve data

Neo4j is not the only one you could work on, Oracle is also capable of storing data with the
desired structure, but as far as I know, many of the advanced Python libraries for graph learning
problems (like NetworkX) are compatible with Neo4j.
In the process of learning elements of data retrieval and getting familiar with data structures,
make sure that you understand the difference between structured and unstructured data; and
also you know how each database store and index data.
Keep in mind that as a data scientist, you need to deep dive into these subjects. But as a part
of data manipulation you will have to know how to get your data in an efficient way.
The list I have presented above is not comprehensive in any manner. But These are the main
references of different databases which store data in different ways. But if you know how they
work, it is pretty much guaranteed that you could work with any other database.

1.3 Mathematics
Almost all problems in the field of artificial intelligence and machine learning are mathematical
problems, so basically to enter the field you must have a good grasp on different mathematical
subjects. As a matter of fact, learn as much maths as you can. Especially when you encounter
various AI/ML algorithms, it is usually crucial to know how they work and what mathematical
concepts are used to derive them. This helps you not only choose best algorithms for attacking
a problem, but also equipped you with ways of how the algorithms could be optimized and even
gives you the ability to create new algorithms from the existing ones.

The main mathematical concepts you must be familiar with are listed below. The ones written
in red are more advanced and require more time and effort. But I have attached some references
to help you get onboard with them and understand the motivation behind them.
Also I have attached the slides I have taught before to see an overview of the concepts.
• Probability theory (Frequentist view and Bayesian view) and Bayesian inference. Also
knowing a few concepts from statistical analysis like testing the hypothesis is necessary
(slides)
• Linear Algebra (slides)
• Optimization methods (slides)

Data Science 3
1.4 Machine Learning

• Stochastic processes and Time series analysis


• Topology (very basic)
• Markov Chain Monte Carlo sampling methods (MCMC)
• Probabilistic graphical models (PGM’s)
• Control Theory, dynamic programming and Bellman equations (for reinforcement learn-
ing)
A good reference to get familiar with the basic concepts that will be critical in machine learning
is: mathematics for machine learning. Another excellent example is Bishop’s book: Pattern
Recognition and Machine Learning. I have attached both to the document I am sending.

1.4 Machine Learning


Machine learning is a pathway to artificial intelligence (AI). This subcategory of AI uses algo-
rithms to automatically learn insights and recognize patterns from data, applying that learning
to make increasingly better decisions. By studying and experimenting with machine learning,
programmers test the limits of how much they can improve the perception, cognition, and ac-
tion of a computer system. Deep learning, an advanced method of machine learning, goes a step
further. Deep learning models use large neural networks - networks that function like a human
brain to logically analyze data - to learn complex patterns and make predictions independent
of human input.
The basic machine learning algorithms are generally divided into two major categories, super-
vised learning and unsupervised learning algorithms. In the former ones, we use labeled
data and machines learn how to label unseen data. The latter one is in general harder, the
machine has to detect the general patterns in data without referencing to any label or tag.
The main supervised and unsupervised learning algorithms you should learn are listed below:
♠ Supervised Learning
• Feature engineering
• Basic regression (Linear/Generalized Linear)
• Tree based classification/regression methods
• Support vector machines (SVM’s)
• Kernel methods - These are not machine learning algorithms, but they help you embed
your data into high dimensional Hilber spaces in other to get rid of limits imposed on
them being in the low dimension.
• Gradient based methods (which in combination with tree based methods give rise to
algorithms like XGBoost)
• Dimensionality reduction techniques, like PCA and MCA
• Nonlinear classifier/regressors like multilayer neural networks (In this step, you do not
need to implement more advanced neural networks, they will come into play, when you
are doing AI)

Data Science 4
1.5 Artificial Intelligence

• parameter learning methods - deeper understanding of loss functions and avoiding over-
fitting and underfitting
♠ Unsupervised Learning
• Basic clustering algorithms like K-means, self organizing maps (SOM’s) and other ones
• Embedding methods (like word2vec, doc2vec, etc)
• advanced clustering/data visualisation methods like t-sne and umap
Again the list is not comprehensive, but in my opinion, every data scientist must know all of
them.

1.5 Artificial Intelligence


Artificial Intelligence is the field of developing computers and robots that are capable of be-
having in ways that both mimic and go beyond human capabilities. AI-enabled programs can
analyze and contextualise data to provide information or automatically trigger actions without
human interference. Today, artificial intelligence is at the heart of many technologies we use,
including smart devices and voice assistants such as Siri on Apple devices. Companies are in-
corporating techniques such as natural language processing and computer vision - the ability for
computers to use human language and interpret images - to automate tasks, accelerate decision
making, and enable customer conversations with chatbots.

The main algorithms you must encounter with are:


• ELements of Natural Language Processing (NLP) and text processing
• Different recurrent neural networks (naive RNN, LSTM, GRU, etc). These are the corner-
stones of NLP. They are capable of performing lots of amazing tasks like translation, auto
completing the search text, and basically almost all rudimentary tasks that a machine is
expected to do in the area of text mining
• Image processing (machine vision): The well known convolutional neural networks (CNN)
and its variations
• Attention networks and transformers (In both cases of text and image processing)
• implementation of PGM’s and its applications. A good example is Restricted Bultzmann
Machine (RBM) which is one the most popular algorithms in developing recommender
systems
• If you are interested (or is necessary for your job) you could learn about voice recognition
techniques.

Data Science 5
1.6 Reinforcement Learning

1.6 Reinforcement Learning

fig 2: How an agent interacts with the environment and takes actions.

Reinforcement learning1 (RL) in the branch of AI where the machine learns to solve compli-
cated problems (like winning a game or put the best advertisement on a webpage or trade in
stock market) solely on its own. In fact, there is no explicit learning instruction here and the
machine learns by interacting with the environment and observes the rewards it gets through
this interaction. In recent years, reinforcement learning gained lots of attention, as it could
solve a variety of different problems. So I encourage you to have a look at the following course
to see more information on the subject: Google’s deepmind course instructed by David Silver:
Introduction to Reinforcement Learning.

1.7 Soft skills


To become a good data scientist, you should be able to communicate with almost all sections
in the workplace (Either a company or a research facility). depending on the company you are
planning to work in, you have to know the functionality of most parts and how each one of
them facing different problems.

1
This is more advanced, but I strongly recommend you to go over the material. The theory behind Rein-
forcement learning is very beautiful and the applications are fun!!

Data Science 6

You might also like