Data Science Guide
Data Science Guide
Steve Nouri
https://www.linkedin.com/in/stevenouri/
COPYRIGHT NOTICE
This book or parts thereof may not be reproduced in any form, stored in any retrieval
system, or transmitted in any form by any means—electronic, mechanical,
photocopying, recording, or other- wise—without prior written permission of the
publisher, except as provided by the United States of America copyright law.
TABLE OF CONTENTS
- - - * - - -
CH. 1 - LAUNCHING YOUR CAREER
1.1 - What do I need to know in order to become a data scientist? / How do I land a job as a
data scientist?
1.2 - What are the most relevant tools to learn TODAY in terms of commercial value?
1.3 - What’s the most efficient way to learn DS / ML as a busy professional?
1.4 - How do I switch careers as quickly as possible?
1.5 - How do I build a portfolio of real-world projects?
When we surveyed 29,265 subscribers on our email list, one of the most common
questions was, “How do I get started in data science and machine learning?”
We’ve compiled this guide of FAQs to help you do just that… and much more. We hope
that you’ll use this guide to jumpstart your journey and cut the learning curve.
Let’s start with how to build a rock-solid foundation of practical skills and knowledge.
Then, later in this guide, we'll cover specific tips for people of various backgrounds.
To start:
1. Read the rest of this guide in its entirety. We surveyed 29,265 subscribers on
our email list, and these are the most common questions we’ve received.
Chances are that you have a few of these questions as well.
2. Circle back to the answer for the question, “What’s the most efficient way to
learn DS / ML as a busy professional?” In that answer, we outline what we’ve
found to be the most efficient roadmap for learning these skills.
3. Get your hands wet immediately. We’ve prepared several tutorials for you to
get started, and we recommend diving into them ASAP. You can find the full list
of links and resources later, but here are a few important ones to look out for:
Throughout this guide, we’ll also have some external links to additional resources or
articles. We recommend reading through the complete guide first, and then checking
them out afterwards.
You’ve made an outstanding career decision to start learning more about DS & ML
(even if you decide it’s not for you). So without further ado, let’s keep going!
First, we’ll address the core skills that every data scientist needs. Then, we’ll
address those categories separately. There are also hybrid roles that require the
skills from both the business and the product side.
Finally, please note that we’re not trying to provide an exhaustive list of
everything you might run into. Instead, our goal is to list the core skills within
each category that will give you the biggest bang for your buck.
There are only 24 hours in a day... and you still need to sleep, eat, work, go to
school, and/or spend time with family and friends. So we’re going to introduce the
core skills that will get you a foot in the door.
And yes, some employers will have more requirements. But if you lock down the
following core skills, you WILL be able to land a high-paying job in this field,
guaranteed.
Examples include:
● Strategy - Using clustering to find “similar” test and control stores for a
chain-wide experiment
Aspiring business data scientists should add the following core skills to their
skillset:
Product data scientists build ML / AI tools and software. They train models, build
prototypes, and integrate ML solutions into other parts of the software. For
product data scientists, the emphasis is on the product that you build.
Examples include:
Aspiring product data scientists should add the following core skills to their
skillset:
So let's make this question more interesting. Let's consider two more factors
aside from employability:
Considering these two factors, the clear winner is the Python programming
language. Python is the most popular language among data scientists, leading to
a wider range of opportunities. It's also famously intuitive and easy to learn.
Thus, our recommendations for tools to learn will all fall under the Python stack:
You can download all those libraries for free using the Anaconda distribution. We
are not affiliated with the authors of that distribution, but we use it for all of our
work as well.
Note: Download the latest version for Python 3.X. Python 2.X is also viable, and
is still used in some places. But all of the major libraries have already been
updated to work with Python 3.X, which will become the standard going forward.
Academia favors this antiquated “bottom-up” approach... but it’s not very practical
for working professionals seeking a career transition. Not only is it long and
tedious, but you’ll also be more likely to lose motivation along the way.
You’ll start with tutorials instead of lectures. A tutorial teaches you how to do
something in as streamlined of a way as possible. As you’ll notice, you won’t
understand how everything is working under the hood… yet.
However, if you follow the tutorial step-by-step, you should be able to see an
entire DS task from start to finish. This is invaluable for your learning journey!
Because when you start to see the big picture, you’ll understand how all the
moving pieces fit together.
After you complete a tutorial, it’s time to apply what you learned to new datasets.
This will allow you to solidify your skills and begin expanding your knowledge.
For example, when you try the same modeling process on a new dataset, you
might run into a new error. Upon googling the error, you might discover that it’s
because the dataset had a different format... or missing values... or mislabeled
classes... and so on. Now you can dig into that topic further and expand your
knowledge... within the context of what you’ve already learned.
This technique of “learning in context” is one of the most powerful learning tools
that we’ve seen. It’s especially useful for busy professionals on a tight schedule.
Roadmap of Topics
Note: We’ll cover some of these in more detail throughout the rest of this guide.
What does this mean? Well… first, it means that you shouldn’t try to learn
everything about DS & ML. Instead, you should pick the closest goal posts and
execute against that target.
Target the core skills that we discussed in the previous question, “what do I need
to know in order to become a data scientist?”
Once you’ve learned the basics of those skills, don't expand the scope of your
studying (a common mistake we see). Instead, focus on showing those skills to
employers! Build a portfolio of real-world projects that you can point to and prove
your competency.
2. Pick out a dataset to start with. Choose a dataset that is in a domain you
might wish to enter and allows you to show your skills (i.e. no toy problems).
We’ve hand-picked some great datasets for you on this resource page.
4. Explore the data and make sure you understand the features. The first
step is to explore the data and make sure you understand it from an intuitive
perspective. Only then can you pose interesting questions to answer.
6. Clean the data, engineer features, and build your analytical base table
(ABT). The next step is to create an “analytical base table” from the original
dataset. Pre-processing the data allows you to answer more interesting
objectives.
7. Complete your analysis / train your models. Once you’ve created your
analytical base table, you’ll have already done most of the heavy lifting. All
that’s left is to finish the analysis/modeling part of your project.
8. Write about your project directly inside your Jupyter notebook. Write a
detailed intro. Then, explain your data, describe your objective, and
summarize your results / key-takeaways. You can also write about how you'd
expand upon your project further.
10. Repeat steps (2) to (9) for a handful of other datasets and problems. Et
voilà, your portfolio is ready to go! Finally, link to your portfolio from your
resume, LinkedIn, and job board accounts.
Data Analysis
This one is fairly self-explanatory. Data analysis has existed in some form or
another since the ancient world. Ancient Roman armies would send Speculatores
and Exploratores ahead to scout and track enemy movements (i.e. collect
“data”). Then, military advisors would “analyze” that data and help the
commanders make more informed decisions.
Today, it’s the same idea—modernized. Software collects the data. Analysts
extract insights from it. And business leaders get to make more informed
decisions.
Data Science
In practice, data science leverages data analysis, applied machine learning, and
domain knowledge. It’s commercially-oriented. So it’s essential to develop your
domain expertise as a data scientist, and not only the technical skills.
Machine Learning
The key word there is “explicit.” For true machine learning, a computer must be
able to recognize patterns that it’s not explicitly programmed for identify. Machine
learning algorithms process data and build models from the patterns they
observe.
For more information, see the section titled “What makes machine learning so
special?” in our Bird’s Eye View of Applied Machine Learning.
Artificial Intelligence
Imagine you wanted to program a self-driving car and train the computer to know
what to do at a traffic light. Well, you could explicitly instruct the computer to
always stop at a red light, slow down at yellow light, and go through at a green
light. In fact, this is how the AI of most computer games work—with a set of
specific instructions for various game states.
And yes, that would certainly be an attempt toward AI... but it wouldn’t be very
effective in the messy real world. There are so many “states” you might not have
accounted for.
For example, what if someone is still crossing the road when the light turns
green? What if the light becomes broken? What if it’s flashing yellow? What if the
light is not a traffic light but rather something else, such as police lights?
Machine learning, on the other hand, does not rely on explicit instructions for
each state. Instead, you’ll feed the computer as much relevant data as you can
gather. Then, one of many possible “algorithms” will build a “model” from that
data. That “model” will then be able to take a new input (captured by the camera)
and provide an output (instructions for the car) with a certain level of confidence.
Deep Learning
Deep learning refers to a family of ML methods that deal with neural networks.
Neural networks usually need much more data to train than other ML methods.
Deep learning offers exceptional performance in some, but not all domains. It
shines in domains like computer vision, natural language processing, and audio
processing.
If your goal is simply to land a high-paying job in data science, then you can do
so with very little math foundation... IF you learn how to apply the right tools, at
the right places, in the right way.
If you can prove your skills, then at least one great company out there will give
you a chance. Currently, the demand for DS skills vastly outpaces the supply...
so companies will NOT turn you away if you can prove your abilities.
Follow the top-down approach we outlined in “What’s the most efficient way to
learn DS / ML as a busy professional?” Then, focus on building a portfolio of
real-world projects.
After you master the DS / ML workflow, you can then dive into the theory to
supplement your practical skills.
If you wish to perform original research in ML and work on things like self-driving
cars, then you’ll need more math. Yet even so, our recommendation would still
be to pick the nearest goalpost and start with that. Follow the top-down approach
to get a foot in the door first. This will give you a professional environment to dive
further into the math and theory.
The biggest opportunities with DS & ML in the future will NOT lie in their
implementation, but rather in their application.
Today, data scientists will almost NEVER code an algorithm from scratch or
derive any sort of math formula. Instead, pre-existing implementations (like
Python’s Scikit-Learn library) have become the industry standard.
The technical skills will not be difficult to learn. Instead, the value that you can
add as a data scientist will come from your creativity and domain expertise.
In the business world, companies care about results. A data scientist who
leverages existing tools will outperform one who tries to do everything from
scratch.
Are married to specific methods Know when & when NOT to use ML
Use data to prove their biases Use data to correct their biases
We’ve seen...
They do so by:
(2) building a portfolio of projects that help them prove their real skills beyond a
shadow of a doubt.
Fact: DS skills are in heavy demand right now, with not enough supply. Fact: as
long as you can prove that you have these skills, someone will give you a shot.
1.) Pick the nearest goal post and get a foot in the door first.
A lot of people try to jump straight into the deep end. This neither necessary nor
recommended for aspiring data scientists seeking entry-level positions. For
example, if you don’t have a technical background, don't start by aiming to
research neural nets at Google.
Even if that’s where you’d like to end up, it’s not the best target to start with.
Begin with the core skills of data analysis and applied machine learning. You’ll
get more mileage from these fundamental skills. They'll give you "marketability"
to get hired. Then, you can always learn the rest along the way.
2.) Start with a top-down approach, and don’t get lost in the weeds.
First of all, Know that you can develop the technical skills fairly quickly by using a
“top-down” approach to skip the unnecessary parts of the theory, instead of a
“bottom-up” approach... and when you do, there will be HEAVY demand for
someone of your profile. For more info, see our answer to the question, “What’s
the most efficient way to learn DS / ML as a busy professional?”
Remember that data science is never done in a vacuum, and technical skills are
only one piece of the puzzle. The bottom line is that employers want to know if
you can use DS to help them make more money.
So emphasize your strengths. Show employers that you can spot opportunities.
Show them that you can connect DS/ML with tangible business value. You can
do so in two ways.
First, you should tailor your portfolio projects to highlight your domain expertise.
More on this in the next tip.
Second, during your interviews, you should always shift the conversation to
business value. Arrive prepared with ideas of how DS/ML can help the
employer's business.
The first step is to learn the core skills of applied ML, which we've covered
earlier. After you do so, you'll understand the capabilities and limitations of ML as
a technology. Combine this understanding with your previous experience... and
BOOM... you're now a candidate that employers will drool over.
Again, the first step is to learn the basics using tutorials. Then, hone your skills
on real-world datasets with commercial use cases. You’ll accumulate a portfolio
of real-world projects that you can use to get a foot in the door. This is especially
important for people coming from business backgrounds. It will prove your
technical competency and show your willingness to learn.
5.) Don’t limit your search to positions with “data scientist” in the job title.
This is especially true if your current position does not ask you to handle data or
do any form of analysis. Seek adjacent positions that will eventually allow you to
transition into data scientist.
1.) Focus on developing real skills that can drive business value.
Most employers will not care about your DS 101’s “final project” that has you
classifying kittens and dogs. Instead, seek real-world datasets with commercial
use cases. Hone your skills on those. These datasets are messier, more
ambiguous, and contain red herrings to filter out.
2.) Build a portfolio of real-world projects, not toy problems from school.
This is an extension of tip #1. As you tackle those real-world datasets, you can
build a portfolio of projects at the same time. You can do so by including
write-ups with detailed introductions and descriptions.
Complete them in Jupyter Notebooks and host the final notebook online. There
are a variety of free ways to do so (such as Github or Google Drive). You can
then link to your portfolio on your resume, LinkedIn, and job board profiles.
This is one of the best ways to stand out from the sea of applicants who can only
make empty claims.
The best way to land an internship is the same as landing a job. Prove that you
have real skills that can help a company make more money. Learn the skills,
build a portfolio, and apply to as many relevant positions as you can manage (it’s
a numbers game).
After you apply, prepare for the interview process. Review key concepts and
practice explaining projects in a clear and concise way.
4.) Don’t limit your search to positions with “data scientist” in the job title.
Also seek adjacent positions that will eventually allow you to transition into data
scientist. Great examples would be Data Analyst, Software Developer, Marketing
Analyst, Business Consultant, etc.
Each of these positions will give you invaluable work experience. At the same
time, they'll expose you to a part of the skills necessary for DS, allowing you can
make up the rest on your own.
Many positions will claim they need X years of work experience. Think of that as
a “target” instead of a hard “cutoff.”
At the amusement park, for some rides you "must be this tall to ride.” But the job
market is different. At many places, the work experience "requirement" is more of
a preference. It's “we prefer you to be this tall to ride.” In other words, don’t be
discouraged.
As a student, time is on your side. You have more control over your time, so use
that to an advantage. Go all out in the numbers game. Full court press. Just
apply to as many relevant positions as possible... and let the opportunities filter
themselves.
1.) Data science is not only machine learning; analytical skills are crucial.
Software engineers often gravitate toward the machine learning side of data
science. It’s closer to their comfort zone. But to become a well-rounded data
scientist, analysis and domain expertise are vital.
Find a good dataset and read its description. Then, brainstorm a list of
compelling questions that the dataset might answer.
For example, let's say you find a dataset on school dropout rates. You might ask
questions such as:
Once you have a list of your questions, practice answering them! Try displaying
key statistics from the dataset... or plotting visualizations... or taking slices of the
data... or taking sums, averages, and so on.
Even if you discover that you can't answer the question, simply trying to will
sharpen your analytical skills.
We’ve seen many software engineers who want to transition into the field get
bogged down by the math. In reality, you probably need to know much less than
you think you do.
Go with the “top-down” approach we outlined earlier. Don’t feel pressured to lock
down all the math right from the start, as you can learn it as you go.
3.) Domain knowledge can help you stand out big time.
Many software engineers are already very strong in their technical skills... so one
of the best ways to stand out is to show your willingness to learn about the
domain.
For example...
You get the point. You can only connect DS with business value if you
understand the business you're in.
In general, software engineering is about making a plan and then executing on it.
You’ll map out the architecture, spec out the features you’ll need, and then come
up with a to-do list to execute against.
Yes, you’ll navigate with a framework (e.g. clean data → engineer features →
choose algorithms → train models). But you’ll often need to change your plan on
the fly as you uncover more insights from the data.
While some software engineers are great communicators, it’s usually not a big
part of the job. So it’s crucial that you practice explaining complex topics in clear
and concise ways. Our recommendation: grab a friend who knows nothing about
what you do... and then try to explain your job to them in plain English. (It works!)
Think of it this way: your lack of relevant work experience means that employers
will see you as a risky hire. So how can you mitigate that risk for them?
Well, step one is to develop the real skills capable of driving business
value. We’re not trying to “hoodwink” anyone here. You can cut the learning
curve by following the top-down approach we outlined earlier.
Then, once you’ve gotten the basics down, step two is to prove that you have
those skills. You don’t have the relevant work experience to back you up... so
what do you do? You build a portfolio of real-world projects.
We’ve repeated this point several times by now, but it’s really that simple. It’s all
about risk-mitigation for the employer. There’s no better tool for doing so than
having something tangible that you’ve built and can show.
How much? It depends on the tools you already work with. If you use R or Stata
on a daily basis, then you'll have a nice head start.
If you mostly use Excel, then you’ll want to add basic programming skills to your
repertoire.
You don’t need to get too fancy with it. Pick up the basics of a language like
Python (our choice) or R. Then, try to recreate some of the analyses you’re
already doing.
For example, you can try to replicate a pivot table using the Pandas library’s
groupby function. (You'll discover that it’s often much faster and easier with
Pandas when you get the hang of it.)
Remember, the key difference between data analysis and data science is the
addition of machine learning.
Data scientists will need to understand machine learning, regardless of the role.
Learning machine learning will also give you some great programming practice.
We recommend the Scikit-Learn library.
One of the best ways to transition into a data scientist is to start working more
like a data scientist. You can do so by proposing machine learning projects at
your workplace.
We’ve found that upper management are very receptive to the idea when you
frame it as an “experiment” or a “pilot” to expand your team’s capabilities.
You might already be work with data during your day job, it might be limited by
what your company has access to.
As you improve your applied ML skills, you should also expand the range of
datasets you can work with. We’ve handpicked some for you here: Datasets for
Data Science and Machine Learning.
Ok, we’ve mentioned this several times already, so we’re not going to beat a
dead horse. Just know that a portfolio of projects is one of the best things you
could create to help get your foot in the door.
As you expand your comfort zone with new datasets (tip #4), think about how you
can create full-length projects out of them. Then, host them online on a site like
Github. Complete your project inside Jupyter Notebook. It integrates nicely with
Github and also allows you to export your notebook as a web-page.
4.1 What does the career path of a data scientist look like?
In general, there are two types of data science career paths, each with its own
appeal.
Path #1 - Leveling Up
The first is the level up path. Data science is a skill that you can continuously
level up, and your career will grow alongside. For example, here’s a sample path
from Data Science Intern to Director of Data Science:
Source: Indeed.com
As you can see, this is the more “straightforward” career path within data
science. As you get better, you’ll earn higher salaries, lead bigger projects, and
get more senior titles. Large players in the economy—from tech giants to Fortune
500's—are all hiring data scientists. When you’re ready for the next level, there
will be an opportunity awaiting you.
The second path is the choose your own adventure approach. The modern
economy is data-driven. Data science is not a buzzword—the ability to extract
actionable insights from data really does help companies make more profit.
So you don’t even need to continue “leveling up” as a data scientist if you don’t
want to.
● Finance & investing (banks and hedge funds are big employers in this
space)...
The point is that when you develop the skills, you’re not tied to the position
unless you prefer to be. Your data background will be in very high demand,
opening the door to opportunities that you otherwise would not have access to.
For learning purposes, you can choose to code a few of your favorite algorithms
from scratch. But we wouldn’t recommend sinking too much time optimizing your
code or worrying about the nitty gritty.
Advantages Disadvantages
Can learn how the algo works Higher math & programming
under the hood requirements
Advantages Disadvantages
Much more commercial demand Cannot see each step of the algo
4.3 How can I stay abreast with the latest tools and best
practices given the rapid pace of this industry?
We have some pretty unconventional advice for this one. The usual advice is to
follow industry publications, blogs, and conferences. Of course, that advice WILL
work. It’s applicable to almost every field, including medicine, engineering, sales,
and so on.
But as you’ve probably gathered by now, our approach is to try to get as much
hands-on experience as possible. So we recommend the following:
Implementation requires you to pursue the “best technology.” But here's the
problem. There are already very potent pre-existing libraries (e.g. Scikit-Learn).
Cloud-based solutions (e.g. AWS ML), are also being actively developed.
In the future, ML and DS might even be more automated, with platforms that
handle much of the DS/ML workflow. In other words, if you focus on
implementation, you’ll be in race that you simply can't win.
But you know what can't be easily automated? The application of these
technologies to drive real-world business value.
5. Empathy - the ability to understand how your solutions will affect real
people... and how to create win-win scenarios (i.e. expanding the pie
instead of stealing the pie).
According to RemoteOK.io, the median salary for remote data scientists at the
time of this writing is $88,750 USD. That is very healthy income in any part of the
world, and it can be a life-changing salary in some.
We have hired from some of the following platforms, and we’ve heard great
things about the others.
Freelancing
● UpWork
● Toptal
● AngelList
● FlexJobs
● RemoteOK.io
● ZipRecruiter
● SimplyHired
- - - * - - -
And that wraps up the EDS Career Guide! Visit EliteDataScience.com for more...
● Guides
● Concept Explainers
● Code Tutorials
● Career Guidance
● Tools & Resources