Data Scientist: How To Become A
Data Scientist: How To Become A
DATA SCIENTIST
A STEP BY STEP GUIDE
01
Foreword
Data science is a dynamic and growing field that lies at the crossroads of other fields like
statistics, computer science, and business management. In this book, we explore the most
basic and burning question asked by those looking to make a career in data science - how
do I become a data scientist?
The book is divided into ten sections. The first chapter, defines data science and traces its
origins. The second chapter describes data scientists. It tells you who they are, and what
they do. The third chapter provides a case study of data science at LinkedIn. It was introduced
and implemented by Jonathan Goldman, a physicist from Stanford, who used data to make
the social networking website popular among professionals. Chapter Four breaks down
the data science approach to solving problems into eight distinct and easy-to-follow steps.
Chapter Five is the heart of the book. It tells you how to become a data scientist by taking
you through everything you need to know about six of its core components.
Chapter Six outlines the top ten machine learning algorithms. Chapter Seven discusses the
most popular jobs in the field. Chapter Eight maps the scope of and opportunities in data
science. Chapter Nine provides a glossary of key terms. And lastly, Chapter Ten summarizes
the key points made in this book to set you off on your exciting data science journey.
Vikalp Jain
President, AcadGild
Jan, 2018
Bangalore
Table Of Contents
Chapter-1
Data
Science
Tableau Hadoop
Qlik View Domain Sparks
Expertise
SAS VA Hive
Excel SQL
Data science is a dynamic and growing field that lies at the crossroads of other fields like
statistics, computer science, and business management. It refers to processes and methods
that help us make sense of large volumes of data for organizational purposes. Although it
is an amalgamation of many disciplines, it does not draw from each of them equally or in
fixed proportions. Data science draws chiefly from statistics and computer science. Statistics
provides the framework to explore data, find its significant features, and communicate it
visually. Computer science provides the technological support required to process and
extract knowledge from large data sets.
Data science is often thought of as a new field of study. However, its origins can be traced
back to the time of the digital revolution (between the 1950s and 1970s), when technology
significantly altered the way humans interacted and socialized. In 1962, John W. Tukey
described this change in his visionary article, “The Future of Data Analysis”. In it, he envisioned
data analysis as a mode of scientific inquiry that was intrinsically empirical and potentially
beneficial to all fields of science and technology. It wasn’t until the end of the first decade
in the new millennia, however, that the term “data scientist” was coined. It was first popularized
in 2008 by DJ Patil of Linkedin and Jeff Hammerbacher of Facebook. In the next three
years, the number of job listings for “data scientist” skyrocketed; the listings increased by a
staggering 15,000%.
Chapter-2
The job of a data scientist has been labelled as the “sexiest job of the 21st century”
by Harvard Business Review. But what does this job entail? Data scientists work with large
quantities of structured and unstructured data. Structured data refers to organized information
that is easily accessible. Unstructured data, on the other hand, is less organized. The lack of
structure makes compiling and interpreting this form of data a messy and tedious task.
The challenge of the modern world is to keep up with seemingly infinite volumes of
ever-changing types of data. The data scientists’ job is to help decision makers interact
with and interpret data for specific purposes.
A data scientist is driven by the desire to uncover the underlying principles governing a
data set. He likes to solve problems, and can make accurate associations between disparate
or incomplete data sets. The data scientist is usually a master communicator. Not only is
he proficient in programming languages, but also in verbal and visual languages that help
him be an interpreter and communicator of data. In short, the data scientist is a hacker,
an analyst, a communicator, and an adviser, all wrapped in one.
Chapter-3
Jonathan Goldman started working for LinkedIn in June 2006. The social networking
website was growing well and had close to 8 million users at the time. Despite the growing
number of users, however, something was missing. Professionals weren’t networking as
much as the executives at Linkedin wanted. One manager likened the experience of the
website to attending a conference reception where you didn’t know anyone.
The name and logo of LinkedIn are registered® trademarks of the company. Their use in this book does not imply
any affiliation with, or endorsement by LinkedIn
Goldman held a PhD in Physics from Stanford. He was curious and possessed a bent for
analytics. He remained focused on the networking problem, and observed how users connected.
Soon he was able to gather insights. His ideas were met with skepticism at the start. But
Reid Hoffman – the company’s co-founder and then-CEO – backed him and encouraged
him to wield the magic of analytics. Hoffman had experienced success with analytics in
the past at PayPal. He gave Goldman a great deal of autonomy and freedom to test his
ideas in the form of ads on the website’s most popular pages. The rest, as they say, is
history.
Goldman’s ads, which tried to guess a user’s network, worked brilliantly. It had
click-through rates like the company had never seen. “People You May Know” ads became
a regular feature on the website. Goldman refined his suggestions using predictive
models like “triangle closing”. The model recommended John to Sue, if they had many
mutual friends. Other factors that predicted connections included tenures at schools and
workplaces. It gave Linkedin millions of new pageviews and made it a great platform for
professional networking.
The case study used in this chapter has been taken from the article ‘Data Scientist: The Sexiest Job of the 21st Century’,
which was published in the October 2012 issue of the Harvard Business Review. To view the article, click here.
Chapter-4
(Feedback)
Present Make Refine
Findings Decisions Findings
Data science is a set of processes that seek to gather, analyze, interpret, and present data
in meaningful ways. These processes come together to make what I like to refer to as the
‘Data Science Way’ of solving problems. The way comes full circle, as every problem leads
to a new discovery that throws up new problems. Ultimately, the data science way is a
continuous process of discovery and re-discovery, and of new insights and challenges in
the wake of those insights. The following are the steps that make up the data science way:
Data Collection
Identify what data will be required to solve the business
problems defined in the step above. Once you have
identified the data requirements, figure out how to
3. Collect Data access this data. You might need to connect to an internal
database or use APIs to pull data from third-party sources.
Model Data
Once you have the clean and relevant data, you start
correlating it with the business problem defined in Step
2 and make recommendations based on your findings.
In this step, your statistical and machine learning (ML)
skills come in handy for building models that predict
business outcomes and provide recommendations.
However, statistical and ML skills alone are not enough;
data scientists must understand the business well
enough to know whether the results of the models are
meaningful and relevant.
Present Findings
DSI
Share your findings with others so that solutions can
be implemented. Make the best use of visual media to
communicate aesthetically, and rely on the precision of
verbal language to communicate all insights clearly.
Refine Findings
The last step is to refine your findings as much as possible
by repeating the processes. New data could help validate
your findings or modify it according to changing trends.
This step guarantees your operations are up to date
with changing times.
Chapter-5
How to Become
a Data Scientist
A good data scientist must master the six most essential and broad components of data
science – statistics, programming, big data, data visualization, machine learning, and
business acumen. The following guide has been designed to set you off on an enriching
journey in this field. It outlines what you need to know to become a proficient data scientist.
Basic Statistics
Statistics is a broad field that deals with collection, analysis,
interpretation, presentation, and organization of data.
Thus, it isn’t surprising that all data analytics algorithms
use statistical principles for data analysis. The process
requires at least a basic understanding of descriptive
statistics, and probability theory.
Programming Languages
Programming languages help data scientists design
tools for data analysis. Python and R are two programming
languages that data scientists use widely.
1. Phython Programming
The general-purpose programming language was judged
the best programming language of 2017 by IEEE Spectrum,
and for good reason. It is fast becoming the most popular
language among data scientists. Python lets you work fast,
is flexible, and uses elegant syntax that is easy to learn. It
also has an extensive library of codes that make it a superb
tool for analytics.
2. R Programming
R is a language and environment for statistical computing
and statistical graphics. It is a GNU project like S, which was
developed by Bell Laboratories. Codes in S work in R. The
open-source platform offers many features such as linear
and nonlinearmodelling, time-series analysis, etc. These
features are useful for statistical analysis and representa-
tion. It runs on several platforms and systems like FreeBSD,-
Linux, Windows and the MacOS, and is a free software under
the terms of GNU’s Public License. To learn R, sign up for
AcadGild’s course on Data Analytics.
1. Hadoop
Apache Hadoop allows data scientists to store and process
large amounts of data quickly and easily. It uses a distributed
file system to speed up computing and eliminate the risk of
failure. If one of the nodes is down, jobs are sent to other
nodes so that the data processing doesn’t stop. The software
is Java-based, and free. It’s an important tool that helps you
easily scale up your data computing capability.
2. Spark
Apache Spark is another type of software used for data
processing. It is used by companies like Netflix, Yahoo,
and Ebay on a massive scale. Spark’s open-source community
has over 1,000 contributors from 250+ organizations. It is
fast and holds the world record for large-scale, on-disk
data sorting. What’s more? It is easy to use and comes
with high-level libraries that include support for SQL que-
ries, machine learning and graph processing. Spark greatly
increases developer productivity by seamlessly integrating
complex workflows.
Supervised machine learning is enabled by algorithms that use a sample data set
to learn and label predictable outcomes. Unsupervised algorithms, on the other
hand, do not have the privilege of a sample data set to learn predictable outcomes
from. Clustering algorithms are good examples of unsupervised machine learning.
Deep learning is a subset of machine learning. Essentially, it’s an algorithm that can
receive and calculate large volumes of input data, and still churn out meaningful
output. What separates deep learning from other forms of algorithms is its ability
to automatically extract features from input data.
To sum up, machine learning falls under artificial intelligence. All machine learning
is artificial intelligence, but not all artificial intelligence is machine learning. Deep
learning is a subset of machine learning that identifies features of input data auto-
matically. (You will learn ten of the top machine learning algorithms in the next
chapter.)
Business Acumen
Business acumen is a key component of data science
because it provides the context for all data science
endeavors. Without an understanding of how businesses
– and, more specifically, domains – function, the data
scientist would not know how to generate key insights,
or what to do with them. The data scientist must be willing
to learn from key stakeholders, and constantly strive to
improve his understanding of the following aspects of
business:
1. Marketing
Data scientists can help marketers use data to test the
viability of products, to gain critical insights about customer
segments, their psychology, or to simply learn what sells.
2. Operations
Data scientists work across different departments and
boards of any organization. Hence, they must have some
sense of how these fragments operate and coordinate.
3. Communication
The data scientist must be a master communicator. He
should be able to communicate clearly and precisely what
the data reveals, and what it means to a varied audience,
including computers.
Chapter-6
Machines are expected to automate about 25% of jobs across the globe in the next ten
years. The number signifies the growing importance of algorithms that enable machines
to learn and perform a variety of tasks – from simple to complex – for different purposes.
Here is our pick of the top ten machine learning algorithms that a data scientist should
know.
2. K Means Clustering
This algorithm groups similar-seeming data into distinct clusters. It is useful for programs
like search engines that can throw up numerous results for any search term. For example,
a search for “uber” could potentially display results for the taxi service company, food that
the same company delivers, or quite simply dictionaries that define the meaning of the
word. Using this algorithm, search engines can display all pages on Uber cabs once it
figures out you’re looking for information about the taxi service.
4. Apriori
This algorithm tries to predict the future using information from the past. E-commerce
websites use it to recommend products based on a customer’s purchasing history.
5. Logistic Regression
This type of algorithm is like the linear regression type. Both are predictive and correlate
variables. The difference, however, is that logistic regression lists a range of possible outcomes,
while linear regression predicts only one.
6. Linear Regression
As explained in the section on statistics, linear regression is used to identify the relationship
between dependent and independent variables. It is used to explain changes in x – the
dependent variable - by tracing it back to changes in y – the independent variable. For
instance, if an increase in investment in advertising results in a proportionate increase in
revenue, the algorithm will suggest higher investment in advertising to increase revenue.
8. Decision Trees
This type of algorithm is used to classify information and predict all possible outcomes
according to classifications. For example, the answer to the question “Are you a data
scientist?” could either be yes or no. If the answer is yes, we can use this algorithm to list
all possible tasks the data scientist engages in to find out what tasks are most popular.
If the answer is no, the algorithm could present a list of other occupations to determine
what the individual does for a living.
9. Random Forests
Many decision trees combine to form random forests. Random forests are detailed
algorithms that accumulate decision trees to classify and correlate more information and
predict more outcomes with greater accuracy
MACHINE LEARNING
Supervised Unsupervised
Learning Learning
Discriminant
SVR, GPR Hierarchical
Analysis
Ensemble
Naive Bayes Gaussian Mixture
Methods
Chapter-7
DAM
$116, 725
ST
BA
$75,069 $118,709
DA $65,991
$62,379
00 00 00 00 00 00 00
0 ,0 0 ,0 0 ,0 0 ,0 0 ,0 0 ,0 0 ,0
$6 $7 $8 $9 $1
0
$1
1
$1
2
Data science is inter-disciplinary and draws from many fields like statistics, mathematics,
computer science, and business management to collect, organize, analyze, and interpret
data. The task and object of this science is novel and challenging. It requires a variety of skill
sets. Hence, data science teams in organizations are generally made up of professionals with
different backgrounds and profiles. The most popular jobs in data science are as follows:
Data Analysts
They are the detectives that specialize in the analysis of data. The primary task of a data
analyst is to dissect and interpret data in meaningful ways for organizations. With their
specialized focus, they aid statisticians and business analysts to run the grand theatre of
data science productively. The average data analyst makes about $62,000 per year.
Business Analysts
Much like data analysts, business analysts are specialists with curious minds inclined to
perform analyses. They typically solve problems. While the data analyst is focused on
problems with data, the business analyst contributes domain knowledge and business
acumen to solve management and operational problems. The average business analyst
makes around $65,000 per year.
Statisticians
The science of data cannot do without statisticians, of course. They are the original data
scientists, and continue to play an active role in this dynamic field. With advancements in
technologies and support from other specialists (like the data and business analysts),
statisticians can now generate more and better insights from larger and more complex
data sets. The statistician makes $75,000 per year on average.
Data and analytics managers decide priorities, manage teams, and ensure that targets
are met. They are the guides that lead the data science journey. For this reason,
they are paid well – around $116,000 per year on average.
Data Scientists
Arguably one of the most popular job titles in the market. Good data scientists are rare,
and in extremely high demand. They are adept at all the aspects of data science that have
been discussed in this book. They can maneuver data efficiently and communicate it intel-
ligently. Additionally, they also possess domain and business knowledge that makes them
indispensable to organizations that hire them. The data scientist makes the most among
all data professionals. On average, a data scientist earns about $118,000 per year.
$150, 000
$120, 000
$115, 000
$65, 000
The information presented in this chapter has been taken from KDnugget’s article on ‘Salaries by Roles in Data Science and Business
Intelligence’, and other market sources. To view the article, click here.
Chapter-8
Data science is relevant for all industries. Hence, it is being implemented across sectors at
an astounding rate. The demand for data scientists has soared through the roof, while the
supply has been few and far between. An increasing number of universities and colleges
are now nurturing and producing data scientists. The advent of e-learning platforms has also
contributed greatly to the supply. Despite the increasing number of data professionals,
however, there remains a shortage due to the high demand for data scientists. In 2017,
Glassdoor ranked it the “best job in America” for the second year running. And Careercast
listed it as one of the “toughest jobs to fill”. There is no doubt that this is one of the most
flourishing career paths right now – and perhaps, as HBR suggested, the sexiest job in the
market.
Here are some facts and figures on the booming field of data science:
By 2025, the sum of all digital data on earth is expected to surpass 1600
trillion gigabytes.
By 2020, every human being on earth will create around 1.5 megabytes of
data per second.
48.4% of the firms surveyed by HBR in 2017 reported that they were
gaining measurable returns on data science investments.
A company in the Fortune 1000 can rake in as much as $65 million with just a
10% increase in data accessibility.
IBM expects the demand for data scientists to increase 28 percent by 2020.
According to the IDC, the revenue from data science is expected to rise
exponentially from roughly $130 billion in 2016 to $200 billion by 2020.
Chapter-9
Big Data refers to large, complex volumes of data that require advanced
analytics for interpretation.
Deep Analytics is the kind of analytics that helps interpret events and
outcomes in great depth. It is typically descriptive in nature.
Exploratory Analysis is the step in the data science journey that seeks to for-
mulate hypotheses. Visualization is an important part of this step.
Chapter-10
Conclusion
DATA
SCIENCE
Data science refers to those processes and methods that help make sense of large
volumes of data for organizational purposes. Its origins can be traced back to the time of
the digital revolution (between the 1950s and 1970s), when technology significantly
altered the way humans interacted and socialized.
The job of the data scientist has been labelled as the “sexiest job of the 21st century” by
Harvard Business Review. Data scientists are highly appreciated because they are proficient in
many trades. The data scientist is a hacker, an analyst, a communicator, and an adviser,
all in one. The ideal data scientist is well-versed in six core components of the science:
basic statistics, programming languages, big data technologies, data visualization tools,
machine learning, and business management.
Data scientists are problem solvers. They are scientists who set clear goals to be achieved,
ask basic questions that help uncover problems, find data that can provide answers,
explore possibilities in interpretation, identify key features and findings, communicate
them for use, and never stop refining what they find.
Data scientists wear many hats in organizations and work under a variety of designations.
On average, a data science jobs pay anywhere between $62,000 and $118,000 annually.
They are in high demand due to shortage of data science professionals in the market, and
the increasing need for their skills across sectors. This book was put together to set aspiring
data scientists on a novel, exciting and fruitful journey in data science.