ML Step by Step
ML Step by Step
It took me ten months to leave that life behind and start feeling like I belonged to the exclusive
world of people who can tell their medians from their means, their x-bars from the neighborhood
pub, and who know how to teach machines what they need to learn.
The transformation process was not easy and demanded hard work, lots of time, dedication and
required plenty of help along the way. It also involved well over hundreds of hours of “studying”
in different forms and an equal amount of time practicing and applying all that was being learnt.
In short, it wasn’t easy to transform from being data dumb to a data nerd, but I managed
to do so while going through a terribly busy work schedule as well as being a dad to a one-
year old.
The point of this article is to help you if you are looking to make a similar transformation but do
not know where to start and how to proceed from one step to the next. If you are interested in
finding out, read on to get an idea about the topics you need to cover and also develop an
understanding of the level of expertise you need to build at each stage of the learning process.
Schlumberger-Private
There are plenty of great online and offline resources to help you master each of these steps, but
very often, the trouble for the uninitiated can be in figuring out where to start and where to
finish. I hope spending the next ten to fifteen minutes going through this article will help solve
that problem for you.
And finally, before proceeding any further, I would like to point out that I had a lot of help in
making this transformation. Right at the end of the article, I will reveal how I managed to
squeeze in so much learning and work in a matter of ten months. But that’s for later.
For now, I want to give you more details about the nine steps that I had to go through in my
transformation process.
Suggested topics:
What is Analytics?
What is Data Science?
What is Big Data?
What is Machine Learning?
What is Artificial Intelligence?
How are the above domains different from each other and related to each other?
How are all of the above domains being applied in the real world?
Schlumberger-Private
Exercise to show that you know:
Write a blog post telling readers how to answer these questions if asked in an interview
Step 2: Learn some Statistics
I have a confession to make. Even though I feel like a machine learning expert, I do not feel that
I have any level of expertise in statistics. Which should be good news for people who struggle
with concepts in statistics as much as I do, as it proves that you can be a data scientist without
being a statistician. Having said that, you cannot ignore statistical concepts – not in machine
learning and data science!
So what you need to do is to understand certain concepts and know when they may be applied or
used. If you can also completely understand the theory behind these concepts, give yourself a
few good pats on your back.
Suggested topics:
Data structures, variables and summaries
Sampling
The basic principles of probability
Distributions of random variables
Inference for numerical and categorical data
Linear, multiple and logistic regression
Create a list of references with the easiest to understand explanation that you found for
each topic and publish them in a blog. Add a list of statistics related questions that one
may be expected to answer in a data science interview
Step 3: Learn Python or R (or both) for data analysis
Programming turned out to be easier to learn, more fun and more rewarding in terms of the
things it made possible, than I had ever imagined. While mastering a programming language
could be an eternal quest, at this stage, you need to get familiar with the process of learning a
language and that is not too difficult.
Both Python and R are very popular and mastering one can make it quite easy to learn the other.
I started with R and have slowly started using Python for doing similar tasks as well.
Schlumberger-Private
Suggested topics:
Extract a table from a website, modify it to compute new variables, and create graphs
summarizing the data
What makes the innings even more remarkable is that the other 43 innings in that test match had
an average of only 10.8 runs an innings, with only about 40% of all batsmen registering a score
of ten or more runs. In fact, the second highest score by an Australian in the match was 20 runs.
Given that Australia won the match by 45 runs, we can say with conviction that Bannerman’s
innings was the most important contributor to Australia’s win.
Just like we were able to build this story from the scorecard of the test match, exploratory data
analysis is about studying data to understand the story that is hidden beneath it, and then sharing
the story with everyone.
Personally, I find this phase of a data project the most interesting, which is a good thing as quite
a lot of the time in a typical project could be expected to be taken up by exploratory data
analysis.
Topics to cover:
Schlumberger-Private
Project output:
Create a blog post summarizing the exercise and sharing the dashboard or story. Use a
dataset with at least ten columns and a few thousand records
That is where unsupervised machine learning algorithms come in. This is not the time to bore
you with details about what these are all about, but the good news is that once you reach this
stage, you have moved on into the world of machine learning and are already in elite company.
Topics to cover:
K-means clustering
Association rules
Milestone exercise:
Step 6: Create supervised learning models
If you had data about millions of loan applicants and their repayment history from the past, could
you identify an applicant who is likely to default on payments, even before the loan is approved?
Given enough prior data, could you predict which users are more likely to respond to a digital
advertising campaign? Could you identify if someone is more likely to develop a certain disease
later in their life based on their current lifestyle and habits?
Supervised learning algorithms help solve all these problems and a lot more. While there are a
plethora of algorithms to understand and master, just getting started with some of the most
Schlumberger-Private
popular ones will open up a world of new possibilities for you and the ways in which you can
make data useful for an organization.
Topics to cover:
Logistic regression
Classification trees
Ensemble models like Bagging and Random Forest
Supervised Vector Machines
You have not really started with creating models till you have done this:
Take a dataset, create models using all the algorithms you have learnt. Train, test and
tune each model to improve performance. Compare them to identify which is the best
model and document why you think it is so
Step 7: Understand Big Data Technologies
Many of the machine learning models in use today have been around for decades. The reason
why these algorithms are only finding applications now, is that we finally have access to
sufficiently large amounts of data, that can be supplied to these algorithms for them to be able to
come up with useful outputs.
Data engineering and architecture is a field of specialization in itself, but every machine learning
expert must know how to deal with big data systems, irrespective of their specialization within
the industry.
Understanding how large amounts of data can be stored, accessed and processed efficiently is
important to being able to create solutions that can be implemented in practice and are not just
theoretical exercises.
I had approached this step with a real lack of conviction, but as I soon found out, it was driven
more by the fear of the unknown in the form of Linux interfaces than any real complexity in
finding my way around a Hadoop system.
Topics to cover:
Schlumberger-Private
Upload data, run processes and extract results
after installing a local version of Hadoop or Spark on your system
Step 8: Explore Deep Learning Models
Deep learning models are helping companies like Apple and Google create solutions like Siri or
the Google Assistant. They are helping global giants test driverless cars and suggesting best
courses of treatment to doctors.
Machines are able to see, listen, read, write and speak thanks to deep learning models that are
going to transform the world in many ways, including significantly changing the skills required
for people to be useful to organizations.
Getting started with creating a model that can tell the image of a flower from a fruit may not
immediately help you start building your own driverless car, but it will certainly help you start
seeing the path to getting there.
Topics to cover:
Milestone exercise:
Create a model that can correctly identify pictures of two of your friends or family memb
ers
The internet presents glorious opportunities to find such projects. If you have been diligent about
the previous eight steps, chances are that you would already know how to find a project that will
excite you, be useful to someone, as well as help demonstrate your knowledge and skills.
Topics to cover:
Schlumberger-Private
Data collection, quality check, cleaning and preparation
Exploratory data analysis
Model creation and selection
Project report
Milestone exercise:
Get in touch with a stakeholder who will be interested in your report and share your findi
ngs with them and get feedback
End Notes
Machine learning and artificial intelligence is a set of skills for the present and future. It is also a
field where learning will never cease and very often you may have to keep running to stay in the
same place, as far as being equipped with the most in-demand skills is concerned.
However, if you start the journey well, you will be able to understand how to go about taking the
next step in your learning path. As you must have gathered by now, starting the journey well is a
pretty challenging exercise in itself. If you choose to start upon it, I hope this article will have
been of some help to you and I wish you the very best.
Schlumberger-Private
Finally, I will confess that I got a lot of help with my ten-month transition. The reason I was able
to cover so much ground in this amount of time, along with a busy schedule at work and home,
was that I enrolled for the Post Graduate Program in Data Science and Machine Learning offered
by Jigsaw Academy and Graham School, University of Chicago.
Investing in the course helped in keeping my learning hours focused, created external pressure
that ensured that I was finding time for it irrespective of whatever else was going on in life, and
gave me access to experts in the form of faculty and a great peer group through other students.
Transforming from being non-technical to someone who is comfortable with the machine
learning world has already opened up many new doors for me. Whatever path you choose to
Schlumberger-Private
make this transformation, you can do so with the assurance that going through the rigor will reap
rewards for a long time and will banish any fears of becoming irrelevant in tomorrow’s
economy.
Schlumberger-Private