Mastering Data Analytics - The Field Guide To Data
Mastering Data Analytics - The Field Guide To Data
FIELD GUIDE
to DATA SCIENCE
SECOND
EDITION
›› T H E Sof TO RY
T
HE FIELD
GUIDE
Several years ago we created The Field Guide to Data Science because
we wanted to help organizations of all types and sizes. There were
countless industry and academic publications describing what Data
Science is and why we should care, but very little information was
available to explain how to make use of data as a resource. We find
that situation to be just as true today as we did two years ago, when
we created the first edition of the field guide.
This field guide came from the passion our team feels for its
work. It is not a textbook nor is it a superficial treatment.
Senior leaders will walk away with a deeper understanding of
the concepts at the heart of Data Science. Practitioners will
add to their toolbox. We hope everyone will enjoy the journey.
›› W E A R E A L L
AUTHORS of T HIS
STORY
We recognize that Data Science is a team sport. The Field Guide
to Data Science provides Booz Allen Hamilton’s perspective on the
complex and sometimes mysterious field of Data Science. We cannot
capture all that is Data Science. Nor can we keep up - the pace at
which this field progresses outdates work as fast as it is produced.
As a result, we opened this field guide to the world as a living
document to bend and grow with technology, expertise, and
evolving techniques.
Thank you to all the people that have emailed us your ideas as
well as the 100+ people who have watched, starred, or forked our
GitHub repository. We truly value the input of the community, as
we work together to advance the science and art of Data Science.
This is why we have included authors from outside Booz Allen
Hamilton on this second edition of The Field Guide to Data Science.
If you find the guide to be useful, neat, or even lacking, then
we encourage you to add your expertise, including:
We hope you will all continue to find value from The Field
Guide to Data Science and to share in our excitement around the
release of this second edition. Please continue to be part of the
conversation and take this journey with us.
›› T H E O UTLINE
of OUR STORY
12 ›› Meet Your Guides
46 ›› Take off the Training Wheels – The Practitioner’s Guide to Data Science
Guiding Principles
The Importance of Reason
Component Parts of Data Science
Fractal Analytic Model
The Analytic Selection Process
Guide to Analytic Selection
Detailed Table of Analytics
Data Science is a field that is Leading our Data Science team Data Science is the most fascinating
evolving at a very rapid pace…be shows me every day the incredible blend of art and math and code
part of the journey. power of discovery and human and sweat and tears. It can take
curiosity. Don’t be afraid to blend you to the highest heights and the
art and science to advance your lowest depths in an instant, but it
own view of data analytics – it is the only way we will be able to
can be a powerful mixture. understand and describe the why.
Data Science is about asking bigger Invest your time and energy The power of data science
questions, seeing future possibilities, in data that is difficult to lies in the execution.
and creating outcomes you desire. assemble. If it doesn’t exist,
find a way to make it exist.
T H E F I ELD G U I D E to D A T A S C I E N C E
Steven Mills Alex Cosmas Brian Keller
(@stevndmills) (@boozallen) (@boozallen)
Data Science truly can Data scientists should be truth- Grit will get you farther than talent.
change the world. seekers, not fact-seekers.
Begin every new data challenge Focus on value, not volume. Don’t forget to play. Play with
with deep curiosity along with tools, play with data, and play with
a healthy dose of skepticism. algorithms. You just might discover
something that will help you solve
that next nagging problem.
In the jungle of data, don't The beauty of data science lies Data science is both an art
miss the forest for the trees, in satisfying curiosities about and science.
or the trees for the forest. important problems by playing
with data and algorithms.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› CO M M U N I T Y CONTRIBUTORS
Two roads diverged in a wood, and I— End every analysis with… Data Science is about formally
I took the one in the direction ‘and therefore.’ analyzing everything around you
of the negative gradient, and becoming data driven.
And that has made all the difference.
Armen Kherlopian
(@akherlopian)
T H E F I ELD G U I D E to D A T A S C I E N C E
Data Science Defined
Data Science is the art of turning data into actions. This is
accomplished through the creation of data products, which provide
actionable information without exposing decision makers to the
underlying data or analytics (e.g., buy/sell strategies for financial
instruments, a set of actions to improve product yield, or steps to
improve product marketing).
Performing Data Science requires the extraction of timely, actionable »»Data Product
information from diverse data sources to drive data products.
Examples of data products include answers to questions such as: A data product provides actionable
“Which of my products should I advertise more heavily to increase information without exposing
profit? How can I improve my compliance program, while reducing decision makers to the underlying
costs? What manufacturing process change will allow me to build a data or analytics. Examples include:
better product?” The key to answering these questions is: understand • Movie Recommendations
the data you have and what the data inductively tells you.
• Weather Forecasts
• Stock Market Predictions
• Production Process
Read this for additional background: Improvements
• Health Diagnosis
The term Data Science appeared infrastructure development
in the computer science literature work. We saw the need for a • Flu Trend Predictions
throughout the 1960s-1980s. new approach to distill value • Targeted Advertising
It was not until the late 1990s from our clients’ data. We
however, that the field as we approached the problem
describe it here, began to with a multidisciplinary
emerge from the statistics and team of computer scientists,
data mining communities mathematicians and domain
(e.g., [2] and [3]). Data Science experts. They immediately
was first introduced as an produced new insights and
independent discipline in 2001.[4] analysis paths, solidifying the
Since that time, there have been validity of the approach. Since
countless articles advancing the that time, our Data Science
discipline, culminating with team has grown to 250 staff
Data Scientist being declared the supporting dozens of clients
sexiest job of the 21st century.[5] across a variety of domains.
This breadth of experience
We established our first Data provides a unique perspective
Science team at Booz Allen on the conceptual models,
in 2010. It began as a natural tradecraft, processes and
extension of our Business culture of Data Science.
Intelligence and cloud
T H E F I ELD G U I D E to D A T A S C I E N C E
The differences between Data Science and traditional analytic
approaches do not end at seamless shifting between deductive
and inductive reasoning. Data Science offers a distinctly different
perspective than capabilities such as Business Intelligence. Data
Science should not replace Business Intelligence functions within
an organization, however. The two capabilities are additive and
complementary, each offering a necessary view of business operations
and the operating environment. The figure, Business Intelligence and
Data Science – A Comparison, highlights the differences between the
two capabilities. Key contrasts include:
›› Discovery vs. Pre-canned Questions: Data Science actually
works on discovering the question to ask as opposed to just
asking it.
›› Power of Many vs. Ability of One: An entire team provides
a common forum for pulling together computer science,
mathematics and domain expertise.
›› Prospective vs. Retrospective: Data Science is focused on
obtaining actionable information from data as opposed to
reporting historical facts.
Business Intelligence and Data Science - A Comparison (adapted in part from [6])
The way organizations make decisions has been evolving for half a
century. Before the introduction of Business Intelligence, the only
options were gut instinct, loudest voice, and best argument. Sadly, this
method still exists today, and in some pockets it is the predominant
means by which the organization acts. Take our advice and never, ever
work for such a company!
T H E F I ELD G U I D E to D A T A S C I E N C E
DATA SCIENCE IS NECESSARY...
The Business Impacts of Data Science (adapted from [7], [8] and [9])
From the Data Science perspective, this is a false choice: The siloed
approach is untenable when you consider the (a) the opportunity
cost of not making maximum use of all available data to help
an organization succeed, and (b) the resource and time costs of
continuing down the same path with outdated processes. The tangible
benefits of data products include:
T H E F I ELD G U I D E to D A T A S C I E N C E
High
Degree
of
Effort
Setup Try Do
Evaluate
Evaluate
1 2 3 4
Low
Eliminating the need for silos gives us access to all the data at once –
including data from multiple outside sources. It embraces the reality
that diversity is good and complexity is okay. This mindset creates a
completely different way of thinking about data in an organization by
giving it a new and differentiated role. Data represents a significant
new profit and mission-enhancement opportunity for organizations.
T H E F I ELD G U I D E to D A T A S C I E N C E
Prepare
Once you have the data, you need to prepare it for analysis.
Organizations often make decisions based on inexact data. Data
stovepipes mean that organizations may have blind spots. They are
not able to see the whole picture and fail to look at their data and
challenges holistically. The end result is that valuable information is
withheld from decision makers. Research has shown almost 33% of
decisions are made without good data or information. [10]
When Data Scientists are able to explore and analyze all the data, new
opportunities arise for analysis and data-driven decision making. The
insights gained from these new opportunities will significantly change
the course of action and decisions within an organization. Gaining
access to an organization’s complete repository of data, however,
requires preparation.
Our experience shows time and time again that the best tool for
Data Scientists to prepare for analysis is a lake – specifically, the Data
Lake.[11] This is a new approach to collecting, storing and integrating
data that helps organizations maximize the utility of their data.
Instead of storing information in discrete data structures, the Data
Lake consolidates an organization’s complete repository of data in
a single, large view. It eliminates the expensive and cumbersome
data-preparation process, known as Extract/Transform/Load (ETL),
necessary with data silos. The entire body of information in the Data
Lake is available for every inquiry – and all at once.
The Analyze activity requires the greatest effort of all the activities
in a Data Science endeavor. The Data Scientist actually builds the
analytics that create value from data. Analytics in this context is
an iterative application of specialized and scalable computational
resources and tools to provide relevant insights from exponentially
growing data. This type of analysis enables real-time understanding
of risks and opportunities by evaluating situational, operational and
behavioral data.
T H E F I ELD G U I D E to D A T A S C I E N C E
Data Scientists work across the spectrum of analytic goals – Describe,
Discover, Predict and Advise. The maturity of an analytic capability
determines the analytic goals encompassed. Many variables play key
roles in determining the difficulty and suitability of each goal for an
organization. Some of these variables are the size and budget of an
organization and the type of data products needed by the decision
makers. A detailed discussion on analytic maturity can be found in
Data Science Maturity within an Organization.
Act
Now that we have analyzed the data, it’s time to take action.
Maturity
Advise
Data Silos
Predict
Proportion
Discover
of
Effort
Describe
Collect
Stages of Maturity
T H E F I ELD G U I D E to D A T A S C I E N C E
The maturity model provides a powerful tool for understanding
and appreciating the maturity of a Data Science capability.
Organizations need not reach maximum maturity to achieve
success. Significant gains can be found in every stage. We believe
strongly that one does not engage in a Data Science effort, however, »»W here does your organization
unless it is intended to produce an output – that is, you have the fall in analytic maturity?
intent to Advise. This means simply that each step forward in
maturity drives you to the right in the model diagram. Moving Take the quiz!
to the right requires the correct processes, people, culture and 1. H
ow many data sources do
operating model – a robust Data Science capability. What Does it you collect?
Take to Create a Data Science Capability? addresses this topic. a. Why do we need a bunch of data?
– 0 points, end here.
We have observed very few organizations actually operating at b. I don’t know the exact number.
the highest levels of maturity, the Predict and Advise stages. The – 5 points
tradecraft of Discover is only now maturing to the point that c. We identified the required data and
organizations can focus on advanced Predict and Advise activities. collect it. – 10 points
This is the new frontier of Data Science. This is the space in which
we will begin to understand how to close the cognitive gap between 2. Do you know what questions
your Data Science team is trying
humans and computers. Organizations that reach Advise will be to answer?
met with true insights and real competitive advantage.
a. Why do we need questions?
- 0 points
b. No, they figure it out for themselves.
– 5 points
c. Yes, we evaluated the questions that
will have the largest impact to the
business. – 10 points
T H E F I ELD G U I D E to D A T A S C I E N C E
Building Your Data Science Team
A critical component to any Data Science capability is having the
right team. Data Science depends on a diverse set of skills as shown
in The Data Science Venn Diagram. Computers provide the
environment in which data-driven hypotheses are tested, and as such,
computer science is necessary for data manipulation and processing.
Mathematics provides the theoretical structure in which Data Science
problems are examined. A rich background in statistics, geometry,
linear algebra, and calculus are all important to understand the basis
for many algorithms and tools. Finally, domain expertise contributes
to an understanding of what problems actually need to be solved,
what kind of data exists in the domain, and how the problem space
may be instrumented and measured.
DOMAIN EXPERTISE
Provides understanding
of the reality in which a
problem space exists.
Remember that Data Science is a team sport. Most of the time, you
will not be able to find the rare “unicorns” - people with expertise
across all three of the skill areas. Therefore, it is important to build a
blended team that covers all three elements of the Data Science
Venn Diagram.
DOMAIN EXPERTISE
Provides understanding
of the reality in which a
problem space exists.
Start Here for the Basics 39
BALANCING THE DATA
SCIENCE TEAM EQUATION
2 CS M2 + 2 CS + M DE → CS4 M5 DE
T H E F I ELD G U I D E to D A T A S C I E N C E
Understanding What Makes
a Data Scientist
Data Science often requires a significant investment of time across
a variety of tasks. Hypotheses must be generated and data must be
acquired, prepared, analyzed, and acted upon. Multiple techniques
are often applied before one yields interesting results. If that seems
daunting, it is because it is. Data Science is difficult, intellectually
taxing work, which requires lots of talent: both tangible technical
»»The Triple Threat Unicorn
skills as well as the intangible “x-factors.”
There are four independent yet comprehensive foundational Data Individuals who are great at
all three of the Data Science
Science competency clusters that, when considered together, convey
foundational technical skills are
the essence of what it means to be a successful Data Scientist. There
like unicorns – very rare and if
are also reach back competencies that complement the foundational you’re ever lucky enough to find
clusters but do not define the core tradecraft or attributes of the Data one they should be treated carefully.
Science team. When you manage these people:
Technical: Advanced Mathematics; The technical competency ›› Put extra effort into managing
“Knows How and Computer Science; Data cluster depicts the
What to do” Mining and Integration; foundational technical their careers and interests
Database Science; Research and specialty knowledge within your organization. Build
Design; Statistical Modeling; and skills needed for
Machine Learning; successful performance opportunities for promotion into
Operations Research; in each job or role. your organization that allow
Programming and Scripting
them to focus on mentoring other
Data Scientists and progressing
Data Science Collaboration and Teamwork; The characteristics in the
Communications; Data consulting competency the state of the art while also
Consulting:
“Can Do in Science Consulting; Ethics cluster can help Data advancing their careers.
and Integrity Scientists easily integrate
a Client and
into various market or
Customer domain contexts and partner ›› Make sure that they have the
Environment” with business units to opportunity to present and
understand the environment
and solve complex problems. spread their ideas in many
different forums, but also be
sensitive to their time.
Cognitive: Critical Thinking; Inductive The cognitive competency
“Able to Do or and Deductive Reasoning; cluster represents the type
Learn to Do” Problem Solving of critical thinking and
reasoning abilities (both
inductive and deductive) a
Data Scientist should have to
perform their job.
T H E F I ELD G U I D E to D A T A S C I E N C E
Shaping the Culture
It is no surprise—building a culture is hard and there is just as
much art to it as there is science. It is about deliberately creating the
conditions for Data Science to flourish (for both Data Scientists and
the average employee). You can then step back to empower collective
ownership of an organic transformation.
Centralized Data Science teams serve the organization across all business
units. The team is centralized under a Chief Data Scientist and they all
co-locate together. The domain experts come to this organization for
brief rotational stints to solve challenges around the business. This model
provides greater efficiency with limited Data Science resources but can also
create the perceived need to compete with other business units for Data
Science talent. To address this challenge, it is important to place emphasis
on portfolio management and creating transparency on how organizations
will identify and select Data Science projects.
Deployed Data Science teams go to the business unit and reside there for
short- or long-term assignments. They are their own entity and they work
with the domain experts within the group to solve hard problems. In
the deployed model, Data Science teams collectively develop knowledge
across business units, with central leadership as a bridging mechanism for
addressing organization-wide issues. However, Data Science teams are
accountable to business unit leadership and their centralized leadership,
which could cause confusion and conflict. In this model, it is important
to emphasize conflict management to avoid competing priorities.
The Diffused Data Science team is one that is fully embedded with each
group and becomes part of the long-term organization. These teams work
best when the nature of the domain or business unit is already one focused
on analytics. In the Diffused Model, teams can quickly react to high-
priority business unit needs. However, the lack of central management can
result in duplicate software and tools. Additionally, business units with the
most money will often have full access to analytics while other units have
none—this may not translate to the greatest organizational impact. In this
model, it is important to establish cross-functional groups that promote
organization-wide governance and peer collaboration.
Full descriptions of each operating model can be found in Booz Allen’s Tips for
Building a Data Science Capability [13].
T H E F I ELD G U I D E to D A T A S C I E N C E
How to Generate Momentum
A Data Science effort can start at the grass roots level by a few folks
tackling hard problems, or as directed by the Chief Executive Officer,
Chief Data Officer, or Chief Analytics Officer. Regardless of how an
effort starts, political headwinds often present more of a challenge
than solving any technical hurdles. To help battle the headwinds, it is
important to generate momentum and prove the value a Data Science
team can provide. The best way to achieve this is usually through
a Data Science prototype or proof of concept. Proofs of concepts
can generate the critical momentum needed to jump start any Data
Science Capability Four qualities, in particular, are essential for every
Data Science prototype:
If the first thing you try to do is to ›› Complicated does not equal better. As technical practitioners, we
create the ultimate solution, you will have a tendency to explore highly complex, advanced approaches.
fail, but only after banging your head While there are times where this is necessary, a simpler approach
against a wall for several weeks. can often provide the same insight. Simpler means easier and
faster to prototype, implement and verify.
T H E F I ELD G U I D E to D A T A S C I E N C E
The Importance of Reason
T H E F I ELD G U I D E to D A T A S C I E N C E
Reason and common sense are foundational to Data Science. Without these, data is
simply a collection of bits. Context, inferences and models are created by humans and
carry with them biases and assumptions. Blindly trusting your analyses is a dangerous
thing that can lead to erroneous conclusions. When you approach an analytic
challenge, you should always pause to ask yourself the following questions:
T H E F I ELD G U I D E to D A T A S C I E N C E
execution data types
data ing
models
execu
stream
serial
tion
ex
pa utio
ec
da tch
ra n
lle
ba
ta
l
str
e
exe ami tured
cut ng uc
ion str ta
da
batch d
execution unstructure
data
e tran
offlin g ana sformin
n i n lytic g
lear s
le
in
e an arn
nl al ing
g
o yt
in
ic
rn
s
a
le
pre lytics
lear ised
ana
ning
dic
supervised
learning
rv
tive
upe
uns
learning analytic
models classes
»» Transforming Analytics
›› Aggregation: Techniques to summarize the data. These
include basic statistics (e.g., mean, standard deviation),
distribution fitting, and graphical plotting.
LEARNING STYLE TRAINING STYLE
›› Enrichment: Techniques for adding additional information
to the data, such as source information or other labels.
Unsupervised ››Supervised
Processing:Offline
Techniques that address data cleaning,
Online
preparation, and separation. This group also includes
common algorithm pre-processing activities such as
transformations and feature extraction.
»» Learning Analytics
›› Regression: Techniques for estimating relationships among
variables, including understanding which variables are
important in predicting future values.
SCHEDULING
›› Clustering:SEQUENCING
Techniques to segment the data into naturally
similar groups.
Batch
››Streaming
Classification:
Serial
Techniques
Parallel
to identify data element
group membership.
›› Recommendation: Techniques to predict the rating or
preference for a new entity, based on historic preference
or behavior.
»» Predictive Analytics
›› Simulation: Techniques to imitate the operation of a real-
world process or system. These are useful for predicting
behavior under new conditions.
›› Optimization: Operations Research techniques focused on
selecting the best element from a set of available alternatives
to maximize a utility function.
T H E F I ELD G U I D E to D A T A S C I E N C E
Learning Models
Analytic classes that perform predictions, such as regression,
clustering, classification and recommendation employ learning
models. These models characterize how the analytic is trained to
perform judgments on new data based on historic observation.
Aspects of learning models describe both the types of judgments
performed and how the models evolve over time, as shown in the
figure, Analytic Learning Models.
Semi-
Unsupervised Supervised Offline Reinforcement Online
Supervised
Execution Models
Execution models describe how data is manipulated to perform
LEARNING
an analytic function. TheySTYLE TRAINING
may be categorized STYLE
across a number
of dimensions. Execution Models are embodied by an execution
framework, which orchestrates the sequencing of analytic
computation. In this sense, a framework might be as simple as a
Unsupervised Supervised Offline Online
programming language runtime, such as the Python interpreter, or
a distributed computing framework that provides a specific API for
one or more programming languages such as Hadoop, MapReduce
or Spark. Grouping execution models based on how they handle data
is common, classifying them as either batch or streaming execution
models. The categories of execution model are shown in the figure,
Analytic Execution Models.
SCHEDULING SEQUENCING
T H E F I ELD G U I D E to D A T A S C I E N C E
they represent discrete units of work. As such, it is easy to identify
a specific series of execution steps as well as the proper execution
frequency and time bounds based on the rate at which data arrives.
Depending on the algorithm choice, batch execution models are
easily scalable through parallelism. There are a number of frameworks
that support parallel batch analytic execution. Most famously,
Hadoop provides a distributed batch execution model in its
MapReduce framework.
Batch and streaming execution models are not the only dimensions
within which to categorize analytic execution methods. Another
distinction is drawn when thinking about scalability. In many cases,
scale can be achieved by spreading computation over a number of
computers. In this context, certain algorithms require a large shared
memory state, while others are easily parallelizable in a context
where no shared state exists between machines. This distinction has
significant impacts on both software and hardware selection when
building out a parallel analytic execution environment.
T H E F I ELD G U I D E to D A T A S C I E N C E
Iterative by Nature
Good Data Science is fractal in time — an iterative process. Getting
an imperfect solution out the door quickly will gain more interest
from stakeholders than a perfect solution that is never completed. The
figure, The Data Science Product Lifecycle, summarizes the lifecycle of
the Data Science product.
Setup Try Do
Evaluate
Evaluate
Clustering
Classification
DATA ACTION
› Text › Productization
› Imagery › Data Monetization
› Waveform › Insights & Relationships
› Geo
› Time Series
COMPUTATION
T H E F I ELD G U I D E to D A T A S C I E N C E
GOAL
You must first have some idea of your analytic goal and the end state
of the analysis. Is it to Discover, Describe, Predict, or Advise? It is
probably a combination of several of those. Be sure that before you
start, you define the business value of the data and how you plan to
use the insights to drive decisions, or risk ending up with interesting
but non-actionable trivia.
DATA
Data dictates the potential insights that analytics can provide. Data
Science is about finding patterns in variable data and comparing those
patterns. If the data is not representative of the universe of events you
wish to analyze, you will want to collect that data through carefully
planned variations in events or processes through A/B testing or
design of experiments. Datasets are never perfect so don’t wait for
perfect data to get started. A good Data Scientist is adept at handling
messy data with missing or erroneous values. Just make sure to spend
the time upfront to clean the data or risk generating garbage results.
COMPUTATION
ACTION
If you focus only on the science aspect of Data Science you will
never become a data artist.
T H E F I ELD G U I D E to D A T A S C I E N C E
Decomposing the Problem
Decomposing the problem into manageable pieces is the first step
in the analytic selection process. Achieving a desired analytic action
often requires combining multiple analytic techniques into a holistic,
end-to-end solution. Engineering the complete solution requires that
the problem be decomposed into progressively smaller sub-problems.
Clustering
Classification
DATA ACTION
› Text › Productization
› Imagery › Data Monetization
› Waveform › Insights & Relationships
› Geo
› Time Series
GOAL
DATA ACTION
GOAL
DATA ACTION
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Identifying Spoofed Domains Identifying spoofed domains is important for an organization
to preserve their brand image and to avoid eroded customer
confidence. Spoofed domains occur when a malicious actor
creates a website, URL or email address that users believe is
associated with a valid organization. When users click the link,
visit the website or receive emails, they are subjected to some
type of nefarious activity.
Stephanie
Rivera
Discover spoofed
domains
Store
Generated
List of recently Domains
List of candidate
Describe closeness of registered spoofed domains
spoof to valid domains company domains Generate
Test &
Candidate
Evaluation
› List of Domains
candidate
spoofed
domains Quantitative
Calculate Set Threshold that balances
› List of measure of feature
Metric Threshold false positive and false
recently information value
negative rate
registered
company
domains
› List of candidate spoofed
domains
› List of recently registered Test & Quantitative threshold for
company domains Evaluation automated result ranking
› Quantitative measure of
feature information value
Our team was faced with the problem of alert speeds are vital. Result speed created
identifying spoofed domains for a commercial an implementation constraint that forced us to
company. On the surface, the problem sounded re-evaluate how we decomposed the problem.
easy; take a recently registered domain, check
to see if it is similar to the company’s domain Revisiting the decomposition process led us
and alert when the similarity is sufficiently high. to a completely new approach. In the end,
Upon decomposing the problem, however, the we derived a list of domains similar to those
main computation quickly became complicated. registered by the company. We then compared
that list against a list of recently registered
We needed a computation that determined domains. The figure, Spoofed Domain Problem
similarity between two domains. As we Decomposition, illustrates our approach. Upon
decomposed the similarity computation, testing and initial deployment, our analytic
complexity and speed became a concern. discovered a spoofed domain within 48 hours.
As with many security-related problems, fast
ANALYTIC COMPLEXITY:
Algorithmic complexity (e.g., complexity class
and execution resources)
DATA ANALYTIC
COMPLEXITY COMPLEXITY ACCURACY & PRECISION: The ability
to produce exact versus approximate
solutions as well as the ability to provide a
measure of confidence
ACCURACY
DATA COMPLEXITY: The data
type, formal complexity measures
DATA SIZE &
including measures of overlap and
PRECISION
linear separability, number of
dimensions /columns, and linkages
between datasets
T H E F I ELD G U I D E to D A T A S C I E N C E
FILTERING
How do I identify
IMPUTATION
Your senses are incapable of perceiving the entire universe, so How do I fill in
we drew you a map. missing values
in my data?
1 DESCRIBE FEATURE
EXTRACTION
Data
Science 2 DISCOVER PROCESSING
How do I clean
and separate
my data?
3 PREDICT
4 ADVISE
1
DESCRIBE
How do I develop
an understanding
of the content of
my data?
TIP: There are several situations where dimensionality reduction may be needed:
›› Models fail to converge feature space
›› Models produce results equivalent to
random chance
›› You do not know which aspects of the
data are
2
›› You lack the computational power to ›› the most important DISCOVER
perform operations across the
Feature Extraction is a broad topic and is highly dependent upon the domain area.
3
This topic could be the subject of an entire book. As a result, a detailed exploration
has been omitted from this diagram. See the Featuring Engineering and Feature PREDICT
Selection sections in the Life in the Trenches chapter for additional information.
TIP: Always check data labels for correctness. This is particularly true for time stamps,
4 ENRICH
which may have reverted to system default values. How do
ADVISE
TIP: Smart enrichment can greatly speed-up computational time. It can also be a huge
new info
differentiator between the accuracy of different end-to-end analytic solutions. to my d
Data
Source: Booz Allen Hamilton
Science
Take off the Training Wheels 71
If you want to add or remove data based on its value, start with:
› Relational algebra projection and selection
FILTERING If early results are uninformative and duplicative, start with:
How do I identify › Outlier removal › Gaussian filter
data based on › Exponential smoothing › Median filter
its absolute or
relative values? If you want to generate values from other observations in your dataset, start with:
› Random sampling
› Markov Chain Monte Carlo (MC)
IMPUTATION
How do I fill in If you want to generate values without using other observations in your dataset,
missing values start with:
› Mean › Regression models
in my data? › Statistical distributions
DESCRIBE
How do I develop If you are unfamiliar with the dataset, start with
an understanding basic statistics: PR
of the content of › Count › Standard deviation › Box plots
my data? AGGREGATION › Mean › Range › Scatter plots
How do I collect
and summarize If your approach assumes the data follows a AD
my data? distribution, start with:
› Distribution fitting
2 If you want to understand all the information Sc
DISCOVER available on an entity, start with:
› “Baseball card” aggregation
3
PREDICT
T H E F I ELD G U I D E to D A T A S C I E N C E
tart with:
ataset, CLASSIFICAT
If you want an ordered set of clusters with variable precision, start with:
How do I
› Hierarchical
predict group
If you have a known number of clusters, start with: membership?
› X-means
› Canopy
› Apriori
tart with: If you have text data, start with:
› Topic modeling
TIP: Canopy clustering is good when you only want to make a single pass over the data.
TIP: Use canopy or hierarchical clustering to estimate the number of clusters you
should generate.
Source: Booz Allen Hamilton
If you have known dependent relationships between variables
› Bayesian network
1 4
If you don't know where else to begin, start with:
DESCRIBE › Support vector machines (SVM) ADVISE
› Random forests What co
2 of action
should I
DISCOVER
TIP: It can be difficult to predict which classifier will work best on your dataset.
Always try multiple classifiers. Pick the one or two that work the best to refine and
explore further.
TIP: These are our favorite, go-to classification algorithms.
TIP: Be careful of the “recommendation bubble”, the tendency of recommenders to
recommend only what has been seen in the past.
You must ensure you add diversity to avoid this phenomenon.
TIP: SVD and PCA are good tools for creating better features for recommenders.
Source: Booz Allen Hamilton
If you have expert knowledge to capture
› Expert systems
References we
Technique Description Tips From the Pros
love to read
T H E F I ELD G U I D E to D A T A S C I E N C E
References we
Technique Description Tips From the Pros
love to read
Also known as
'Recommendation,' suggest
or eliminate items from a
Use Singular Value Decomposition Owen, Sean, Robin Anil, Ted
set by comparing a history
Collaborative based Recommendation for cases Dunning, and Ellen Friedman.
of actions against items
Filtering performed by users. Finds
where there are latent factors in your Mahout in Action. New Jersey:
domain, e.g., genres in movies. Manning, 2012. Print.
similar items based on who
used them or similar users
based on the items they use.
Used to express Differential equations can be used to formalize Zill, Dennis, Warren Wright,
relationships between models and make predictions. The equations and Michael Cullen. Differential
Differential
functions and their themselves can be solved numerically and Equations with Boundary-Value
Equations derivatives, for example, tested with different initial conditions to study Problems. Connecticut: Cengage
change over time. system trajectories. Learning, 2012. Print.
Simulates a discrete Discrete event simulation is useful when Burrus, C. Sidney, Ramesh A.
sequence of events where analyzing event based processes such as Gopinath, Haitao Guo, Jan E.
each event occurs at a production lines and service centers to Odegard and Ivan W. Selesnick.
Discrete Event
particular instant in time. determine how system level behavior changes Introduction to Wavelets and
Simulation The model updates its state as different process parameters change. Wavelet Transforms: A Primer.
only at points in time when Optimization can integrate with simulation to New Jersey: Prentice Hall, 1998.
events occur. gain efficiencies in a process. Print.
Creates a standard
representation of data Ingersoll, Grant S., Thomas S.
There are a number of open source software
regardless of source format. Morton, and Andrew L. Farris.
Format packages that support format conversion and
For example, extracting raw Taming Text: How to Find,
Conversion UTF-8 encoded text from
can interpret a wide variety of formats. One
Organize, and Manipulate It. New
notable package is Apache Tikia.
binary file formats such as Jersey: Manning, 2013. Print.
Microsoft Word or PDFs.
Utilize when categories are not clearly defined. Zadeh L.A., "Fuzzy Sets.”
Logical reasoning that
Concepts such as "warm", "cold", and "hot" can Information and Control.
Fuzzy Logic allows for degrees of truth
mean different things at different temperatures California: University of
for a statement.
and domains. California, Berkeley, 1965. Print.
Connectivity based
Provides views of clusters at multiple
clustering approach
resolutions of closeness. Algorithms Rui Xu, and Don Wunsch.
Hierarchical that sequentially builds
begin to slow for larger datasets due Clustering. New Jersey: Wiley-
Clustering bigger (agglomerative)
to most implementations exhibiting IEEE Press, 2008. Print.
or smaller (divisive)
O(N3) or O(N2) complexity.
clusters in the data.
T H E F I ELD G U I D E to D A T A S C I E N C E
References we
Technique Description Tips From the Pros
love to read
A method of variable
selection and prediction.
Akaike's information
criterion AIC is used as
Caution must be used when considering Hocking, R.R. “The Analysis
the metric for selection.
Stepwise Stepwise Regression, as over fitting often and Selection of Variables in
The resulting predictive
Regression model is based upon
occurs. To mitigate over fitting try to limit the Linear Regression.” Biometrics.
number of free variables used. 1976. Print.
ordinary least squares,
or a general linear model
with parameter estimation
via maximum likelihood.
T H E F I ELD G U I D E to D A T A S C I E N C E
LI F E in T H E T R E N C H E S
NAVIGATING NECK DEEP IN DATA
Our Data Science experts have learned
and developed new solutions over the years
from properly framing or reframing analytic
questions. In this section, we list a few
important topics to Data Science coupled
with firsthand experience from our experts.
Going Deep into
Machine Learning
Think about where you were 10 years ago. Could computers understand
and take action based upon your spoken word? Recently, speech-to-text
quality has improved dramatically to nearly perfect accuracy, much to
the delight of many mobile phone users. In other complex tasks, similar
magic capabilities have emerged. The world-record high scores in 29
video games are now held by a machine learning algorithm with no
specific knowledge of Atari or computer games in general.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› National Data Science Bowl
The first-ever National Data Science Bowl offered Data Scientists
a platform through which individuals could harness their passion,
unleash their curiosity and amplify their impact to affect change
on a global scale. The competition presented participants with
more than 100,000 underwater images provided by the Hatfield
Marine Science Center. Participants were challenged to develop a
classification algorithm that would enable researchers to monitor
Aaron Sander ocean health at a speed and scale never before possible.
More than 1,000 teams submitted a Deep Sea, Happy Lantern Festival,
total of approximately 15,000 solutions and Poisson Process, all used CNNs
over the 90 days of the competition. in their solutions. Their results
A large proportion of the participants’ increased algorithm accuracy by 10%
implemented solutions used deep over the state of the art. Without their
learning-based approaches, specifically algorithms, it would have taken marine
Convolutional Neural Nets (CNNs). researchers more than two lifetimes to
The competition forum exploded manually complete the classification
with competitors collectively sharing process. The work submitted by all
knowledge and collaborating to advance the participants represents major
the state-of-the-art in computer vision. advances for both the marine research
Participants tested new techniques and Data Science communities.
for developing CNNs and contributed
»Visit
» www.DataScienceBowl.com
to the development of open source to learn more about the first-ever
software for creating CNN models. National Data Science Bowl
The top three competitors, Team
OUTPUT
INPUT LAYER HIDDEN LAYER
LAYER
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Chemoinformatic Search
On one assignment, my of the molecules were available. We
team was confronted created a model of how molecular
with the challenge of structural similarities might affect
developing a search their properties. We began by
engine over chemical
finding all the sub-graphs of each
compounds. The goal
molecule with length n, resulting
of chemoinformatic
search is to predict
in a representation similar to
Ed Kohlwey
the properties that the bag-of-words approach from
a molecule will exhibit as well as to natural language processing.
provide indices over those predicted We summarized each molecule
properties to facilitate data discovery fragment in a type of fingerprint
in chemistry-based research. These called a “Counting Bloom Filter.”
properties may either be discreet (e.g.,
“a molecule treats disease x well”) Next, we used several exemplars from
or continuous (e.g., “a molecule may the set to create new features. We
be dissolved up to 100.21 g/ml”). found the distance from each member
of the full training set to each of the
Molecules are complex 3D structures, exemplars. We fed these features into
which are typically represented as a non-linear regression algorithm
a list of atoms joined by chemical to yield a model that could be used
bonds of differing lengths with varying on data that was not in the original
electron domain and molecular training set. This approach can be
geometries. The structures are conceptualized as a “hidden manifold,”
specified by the 3-space coordinates whereby a hidden surface or shape
and the electrostatic potential defines how a molecule will exhibit a
surface of the atoms in the molecule. property. We approximate this shape
Searching this data is a daunting using a non-linear regression and a
task when one considers that set of data with known properties.
naïve approaches to the problem Once we have the approximate
bear significant semblance to the shape, we can use it to predict the
Graph Isomorphism Problem.[15] properties of new molecules.
The solution we developed was Our approach was multi-staged and
based on previous work in molecular complex – we generated sub-graphs,
fingerprinting (sometimes also called created bloom filters, calculated
hashing or locality sensitive hashing). distance metrics and fit a linear-
Fingerprinting is a dimensionality regression model. This example
reduction technique that dramatically provides an illustration of how many
reduces the problem space by stages may be involved in producing a
summarizing many features, often sophisticated feature representation.
with relatively little regard to the By creatively combining and building
importance of the feature. When “features on features,” we were able
an exact solution is likely to be to create new representations of data
infeasible, we often turn to heuristic that were richer and more descriptive,
approaches such as fingerprinting. yet were able to execute faster and
produce better results.
Our approach used a training set
where all the measured properties
Models are like honored guests; you should only feed them the
good parts.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Cancer Cell Classification
On one project, was used as feedback into
our team was the Genetic Algorithm. When
challenged a set of features contained
to classify no useful information, the
cancer cell model performed poorly and
profiles. The a different feature set would
overarching be explored. Over time, this
Paul Yacci
goal was method selected a set of
to classify different types features that performed with
of Leukemia, based on high accuracy. The down-
Microarray profiles from 72 selected feature set increased
samples[17] using a small speed and performance as
set of features. We utilized well as allowed for better
a hybrid Artificial Neural insight into the factors that
Network (ANN)[18] and Genetic may govern the system. This
Algorithm[19] to identify subsets allowed our team to design
of 10 features selected from a diagnostic test for only a
thousands.[20] We trained the few genetic markers instead
ANN and tested performance of thousands, substantially
using cross-fold validation. reducing diagnostic test
The performance measure complexity and cost.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Ensemble
The Value of
Models
Several years ago, the Kaggle Photo Quality Prediction
competition posed the question “Given anonymized information
on thousands of photo albums, predict whether a human
evaluator would mark them as 'good'.” Participants were
supplied a large collection of user-generated photos. The goal
was to create an algorithm that could automatically pick out
particularly enjoyable or impressive photos from the collection.
Will Cukierski
Over the course of the competition, 207 people submitted entries. The log likelihood metric was
used to evaluate the accuracy of the entries. Scores for the top 50 teams ranged from 0.18434 to
0.19884, where lower is better. Kaggle data scientist Ben Hamner used the results to illustrate the
value of ensembling by means of averaging the top 50 scores. The figure below shows the results.
Ensemble
Private Leaderboard
Log likelihood (lower is better)
Rank=1
Final Team Rank
The-blue line shows the individual scores all points across the top 50 teams. However,
for each of the top 50 teams. The orange after we increase the number of models in
line shows the ensembled score for the top the ensemble beyond 15, we begin to see
n teams, where n ranges from 1 to the value the ensembled score increase. This occurs
on the axis. For example, the ensemble point because we are introducing less accurate (i.e.,
for Final Team Rank 5 is an ensemble of the potentially overfit) models into the ensemble.
entries for teams 1 through 5. As shown in The results of this simple experiment quantify
the graph, the ensembled score is lower than the value of creating an ensemble model, while
any single individual score. The diversity of reinforcing the idea that we must be thoughtful
models included within the ensemble causes when selecting the individual models contained
the respective errors to cancel out, resulting within the ensemble.
in an overall lower score. This holds true for
While most people associate data volume, velocity, and variety with
big data, there is an equally important yet often overlooked dimension
– data veracity. Data veracity refers to the overall quality and
correctness of the data. You must assess the truthfulness and accuracy
of the data as well as identify missing or incomplete information. As
the saying goes, “Garbage in, garbage out.” If your data is inaccurate or
missing information, you can’t hope to make analytic gold.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Time Series Modeling On one of our projects, the team was faced with correlating the time
series for various parameters. Our initial analysis revealed that the
correlations were almost non-existent. We examined the data and
quickly discovered data veracity issues. There were missing and
null values, as well as negative-value observations, an impossibility
given the context of the measurements (see the figure, Time Series
Data Prior to Cleansing). Garbage data meant garbage results.
Brian Keller
Because sample size was already small, deleting distortion in the underlying signal, and we quickly
observations was undesirable. The volatile nature abandoned the approach.
of the time series meant that imputation through
One of our team members who had experience in
sampling could not be trusted to produce values
signal processing suggested a median filter. The
in which the team would be confident. As a result,
median filter is a windowing technique that moves
we quickly realized that the best strategy was an
through the data point-by-point, and replaces it
approach that could filter and correct the noise in
with the median value calculated for the current
the data.
window. We experimented with various window
We initially tried a simplistic approach in which we sizes to achieve an acceptable tradeoff between
replaced each observation with a moving average. smoothing noise and smoothing away signal. The
While this corrected some noise, including the figure, Time Series Data After Cleansing, shows the
outlier values in our moving-average computation same two time series after median filter imputation.
shifted the time series. This caused undesirable
The application of the median filter approach was By addressing our data veracity issues, we were
hugely successful. Visual inspection of the time able to create analytic gold. While other approaches
series plots reveals smoothing of the outliers may also have been effective, implementation speed
without dampening the naturally occurring peaks constraints prevented us from doing any further
and troughs (no signal loss). Prior to smoothing, analysis. We achieved the success we were after and
we saw no correlation in our data, but afterwards, moved on to address other aspects of the problem.
Spearman’s Rho was ~0.5 for almost all parameters.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Motor Vehicle TheftOn one project, Our team began by parsing and
our team explored verifying San Francisco crime data.
how Data Science We enriched stolen car reporting with
could be applied to general city data. After conducting
improve public safety. several data experiments across both
According to the space and time, three geospatial and
FBI, approximately one temporal hotspot emerged (see
Armen $8 Billion is lost figure, Geospatial and Temporal Car
Kherlopian annually due to Theft Hotspots). The domain expert
automobile theft. Recovery of the on the team was able to discern
one million vehicles stolen every that the primary geospatial hotspot
year in the U.S. is less than 60%. corresponded to an area surrounded
Dealing with these crimes represents by parks. The parks created an urban
a significant investment of law mountain with a number of over-foot
enforcement resources. We wanted access points that were conducive to
to see if we could identify how to car theft.
reduce auto theft while efficiently
using law enforcement resources.
GEOSPATIAL HOTSPOTS
TEMPORAL HOTSPOT
3am 6am 9am Noon 3pm 6pm 9pm
Sun
Sat
Fri
Thu
Wed
Tue
Mon
3am 6am 9am Noon 3pm 6pm 9pm
Source: Booz Allen Hamilton
Our team used the temporal hotspot information in tandem with the insights
from the domain expert to develop a Monte Carlo model to predict the likelihood
of a motor vehicle theft at particular city intersections. By prioritizing the
intersections identified by the model, local governments would have the
information necessary to efficiently deploy their patrols. Motor vehicle thefts
could be reduced and law enforcement resources could be more efficiently
deployed. The analysis, enabled by domain expertise, yielded actionable insights
that could make the streets safer.
T H E F I ELD G U I D E to D A T A S C I E N C E
›› Baking the Cake I was once given a time series set of roughly
1,600 predictor variables and 16 target variables
and asked to implement a number of modeling
techniques to predict the target variable
values. The client was challenged to handle the
complexity associated with the large number of
variables and needed help. Not only did I have
Stephanie a case of the curse, but the predictor variables
Rivera
were also quite diverse. At first glance, it looked
like trying to bake a cake with everything in the cupboard.
That is not a good way to bake or to make predictions!
Repeating what you just heard does not mean that you
learned anything.
A few methods where the data is split into training and testing sets
include: k-fold cross-validation, Leave-One-Out cross-validation,
bootstrap methods, and resampling methods. Leave-One-Out cross-
validation can be used to get a sense of ideal model performance
over the training set. A sample is selected from the data to act as the
testing sample and the model is trained on the rest of the data. The
error on the test sample is calculated and saved, and the sample is
returned to the dataset. A different sample is then selected and the
process is repeated. This continues until all samples in the testing set
have been used. The average error over the testing examples gives a
measure of the model’s error.
There are other approaches for testing how well your hypothesis
reflects the data. Statistical methods such as calculating the coefficient
of determination, commonly called the R-squared value are used to
identify how much variation in the data your model explains. Note
that as the dimensionality of your feature space grows, the R-squared
»Do
» we really need a case study value also grows. An adjusted R-squared value compensates for this
to know that you should
phenomenon by including a penalty for model complexity. When
check your work?
testing the significance of the regression as a whole, the F-test
compares the explained variance to unexplained variance. A regression
result with a high F-statistic and an adjusted R-squared over 0.7 is
almost surely significant.
T H E F I ELD G U I D E to D A T A S C I E N C E
P U T T I N G it A L L T O G E T H E R
›› Streamlining Medication Review
Analytic Challenge
The U.S. Food and Drug Administration (FDA) is responsible for advancing public
health by supporting the delivery of new treatments to patients; assessing the safety,
efficacy and quality of regulated products; and conducting
research to drive medical innovation. Although the FDA
houses one of the world’s largest repositories of regulatory
and scientific data, reviewers are not able to easily leverage
data-driven approaches and analytics methods to extract »»Our Case Studies
information, detect signals and uncover trends to enhance
regulatory decision-making and protect public health. In Hey, we have given you a lot of
addition, a rapid increase in the volume, velocity and variety really good technical content. We
of data that must be analyzed to address and respond to know that this section has the look
and feel of marketing material,
regulatory challenges, combined with variances in data
but there is still a really good story
standards, formats, and quality, severely limit the ability
here. Remember, storytelling comes
of FDA Center for Drug Evaluation and Research (CDER) in many forms and styles, one of
regulatory scientists to conduct cross-study, cross-product, which is the marketing version. You
retrospective, and meta-analysis during product reviews. should read this chapter for what it
is – great information told with a
Booz Allen Hamilton was engaged to research, develop, marketing voice.
and evaluate emerging informatics tools, methods, and
techniques to determine their ability to address regulatory
challenges faced by the FDA Center for Drug Evaluation
and Research (CDER). The main goal was to enable the CDER community to fully
utilize the agency’s expansive data resources for efficient and effective drug review
through the design and development of informatics capabilities based on Natural
Language Processing (NLP), data integration, and data visualization methodologies.
Our Approach
T H E F I ELD G U I D E to D A T A S C I E N C E
package inserts) with data from the of the product quality-review process.
FDA Adverse Event Reporting System Integration of disparate data sources is
(FAERS). Using NLP, we extracted the first step in building a comprehensive
adverse events from the product label profile of manufacturers, facilities, and
to create a structured table of label the products associated with individual
data out of unstructured text. This facilities. To address these challenges,
dashboard allows safety evaluators to we developed a Facility Inventory Report
view whether or not a reported adverse to show the geographic location of
event is already known, without having facilities and their associated metadata.
to access an external data source This geovisualization tool processes and
and read through product labels. transforms raw data into a user-friendly
visual interface with mapping features
Product Quality Analytics.
to enhance the surveillance capabilities
To support CDER’s mission of reviewing
of CDER and provide reviewers with the
and managing product quality, novel
ability to establish connections between
methodologies and tools are needed
facility data and product quality.
to improve the efficiency and efficacy
Our Impact
Since the FDA is responsible for regulating 25 cents of every dollar that Americans
spend, the agency’s ability to fully use regulatory datasets and meaningfully integrate
previously incompatible data to rapidly detect product quality and safety issues is
critical for safeguarding public health. NLP approaches provide CDER with the ability
to more efficiently search a broader range of textual data and enhances the ability to
gain insight from additional data forms that may seem unrelated. Data integration
and visualization directly increase the efficiency of researchers by reducing their time
spent on searching for frequently-performed aggregate or granular calculations,
and by proactively presenting the most frequently desired data to the reviewer
through thoughtful and contextual dashboards designed to reveal patterns and
trends in disparate data sources. These new capabilities position the FDA to enhance
regulatory decision-making, drive advances in personalized medicine, and enable
earlier detection of safety signals in the general population.
Domestic airline departure delays are estimated to cost the U.S. economy $32.9 billion
annually. The Federal Aviation Administration’s (FAA’s) Traffic Flow Management
System (TFMS) is used to strategically manage flights and includes a flight departure-
delay prediction engine which applies simple heuristics to predict flight delays.
However, the limited predictive power of these heuristics constrains the FAA’s ability
to act in accordance with its existing departure-delay management plan. In response,
the FAA’s NextGen Advanced Concepts and Technology Development Group wanted to
create a predictive probabilistic model to improve aircraft departure time predictions.
This new model would help the FAA understand the causes of departure delays and
develop policies and actions to improve the reliability of departure time predictions for
real-time air traffic flow management.
Our Approach
The commercial aviation industry is rich departure deviation. The most critical
in flight operations data, much of which steps in model development were the
is publicly available through government selection of optimal algorithms to
websites and a few subscription discretize model variables, and the
vendors. Booz Allen Hamilton leveraged selection of appropriate machine learning
these sources to gather over 4 TB of techniques to learn the model from the
data detailing tarmac and airspace data. The team followed information
congestion, weather conditions, network theory principles to discretize model
effects, Traffic Management Initiatives, variables to maximize the model’s
and airline and aircraft-specific predictive power, and to represent the
attributes for every commercial flight data as closely as possible with the
departing from U.S. airports between least amount of network complexity.
2008 and 2012. This data included over 50 Booz Allen segmented the model
million flights and around 100 variables variables into three different categories
for each flight. The data included based on the time to flight departure:
composite variables (e.g. incoming flight 24 hours, 11 hours, and one hour.
delay) that were constructed from the Certain flight variables could only
raw data to capture relevant dynamics be known for specific pre-departure
of flight operations. Data acquisition, times. For example, the tarmac and
processing, quality control, and accuracy airspace congestion variables for
between disparate datasets were a flight are only known just before
important steps during this process. the flight, and hence those variables
The team applied supervised learning feature only in the one hour category.
algorithms to develop Bayesian Belief Departure delays were predicted for
Network (BBN) models to predict flight each of the three time horizons.
T H E F I ELD G U I D E to D A T A S C I E N C E
Our Impact
For a typical airport, the model delivers a delay prediction improvement of between
100,000 and 500,000 minutes annually over previous FAA predictions. The model can
be used by a range of aviation stakeholders, such as airlines, to better understand
and predict network flight delays. This can improve the airlines’ operational decisions
to include more proactive schedule adjustments during times of disruption (e.g.
weather or sector load). The improved reliability of departure prediction will improve
FAA's predictions for airports, sectors, and other resources, and has the potential to
enable improved real-time traffic flow management, which can significantly reduce
airline departure, delays, and the associated economic costs. This means a more
efficient and effective air transportation network.
The U.S. Food and Drug Administration (FDA) Center for Biologics Evaluation and
Research (CBER) is responsible for protecting public health by assuring the safety
and efficacy of biologics, including vaccines, blood and blood products. CBER’s
current surveillance process, which requires resource-intensive manual review by
expert Medical Officers, does not scale well to short-term workload variation and
limits long-term improvements in review cycle-time. In addition, the large volume of
Adverse Event (AE) reports received by the Agency makes it difficult for reviewers to
compare safety issues across products and patient populations.
CBER engaged Booz Allen Hamilton to develop advanced analytics approaches for
the triage and analysis of AE reports. The main goal was to leverage (1) Natural
Language Processing (NLP) to alleviate resource pressures by semi-automating
some of the manual review steps through techniques, such as text classification
and entity extraction, and (2) network visualizations to offer alternative interactions
with datasets and support AE pattern recognition. By integrating NLP and network
analysis capabilities into the Medical Officer’s review process, Booz Allen successfully
provided decision-makers with important information concerning product risks and
possible mitigations that can reduce risk.
Our Approach
T H E F I ELD G U I D E to D A T A S C I E N C E
closeness, degree, strength), and handle large amounts of data generated
interact with network nodes to gain by a Monte Carlo Markov Chain analysis
insights into product safety issues. of the spread of influenza, developed
a large database analysis strategy
Other Analytics Solutions. In addition,
involving the application of classification
Booz Allen refactored, modularized,
algorithms to simulated genomic data,
and expended the capabilities of CBER’s
and implemented a Statistical Analysis
computer simulation model of the Bovine
Software (SAS) macro that automatically
Spongiform Encephalopathy (BSE)
compares the relative potency of a given
agent to improve estimates of variant
lot of vaccine using a matched set of dose
Creutzfeldt-Jakob disease (vCJD) risk
response curves.
for blood products, developed code to
Our Impact
New methods for post market surveillance of biologics are critical for FDA reviewers
who must determine whether reported adverse events are actually a result of a
biologic product. With more than 10 million vaccines administered each year to
children less than one-year old, CBER reviewers are under pressure to quickly
evaluate potential safety signals through manual evaluation of AE reports, review
of scientific literature, and analysis of cumulative data using frequency calculations
or statistical algorithms. Booz Allen’s support resulted in the development
of innovative and data-driven approaches for the analysis of structured and
unstructured AE reports. We increased the speed of existing text mining tools by
two thousand times, allowing CBER reviewers to run a text mining algorithm to
extract information contained in VAERS reports in seconds, instead of hours. We
also increased the productivity of Medical Officers through the implementation
of text mining and network analysis tools. These new capabilities allow CBER to
streamline the post market review process, extract knowledge from scientific data,
and address public concerns regarding vaccine safety more quickly and efficiently.
Mass atrocities are rare yet devastating crimes. They are also preventable. Studies
of past atrocities show that we can detect early warning signs of atrocities and that
if policy makers act on those warnings and develop preventive strategies, we can
save lives. Yet despite this awareness, all too often we see warning signs missed and
action taken too late, if at all, in response to threats of mass atrocities.
The Early Warning Project, an initiative of the United States Holocaust Memorial
Museum (Holocaust Museum), aims to assess a country’s level of risk for the onset
of future mass killings. Over time, the hope is to learn which models and which
indicators are the best at helping anticipate future atrocities to aid in the design and
implementation of more targeted and effective preventive strategies. By seeking to
understand why and how each countries’ relative level of risk rises and falls over
time, the system will deepen understanding of where new policies and resources can
help make a difference in averting atrocities and what strategies are most effective.
This will arm governments, advocacy groups, and at-risks societies with earlier and
more reliable warning, and thus more opportunity to take action, well before mass
killings occur.
The project’s statistical risk assessment seeks to build statistical and machine
learning algorithms to predict the onset of a mass killing in the succeeding 12 months
for each country with a population larger than 500,000. The publically available system
aggregates and provides access to open source datasets as well as democratizes the
source code for analytic approaches developed by the Holocaust Museum staff and
consultants, the research community, and the general public. The Holocaust Museum
engaged Booz Allen to validate existing approaches as well as explore new and
innovative approaches for the statistical risk assessment.
Our Approach
Taking into account the power of identifying new datasets, building new
crowdsourcing, Booz Allen put out a call machine learning models, and creating
to employees to participate in a hack-a- frameworks for ensemble modeling
thon—just the start of the team’s support and interactive results visualization.
as the Museum refined and implemented Following the hack-a-thon, Booz
the recommendations. More than 80 Allen Data Scientists worked with
Booz Allen Hamilton software engineers, Holocaust Museum staff to create
data analysts, and social scientists a data management framework to
devoted a Saturday to participate. automate the download, aggregation,
Interdisciplinary teams spent 12 hours and transformation of the open
T H E F I ELD G U I D E to D A T A S C I E N C E
source datasets used by the statistical supporting greater engagement by
assessment. This extensible the Data Science community.
framework allows integration of new
datasets with minimal effort, thereby
Our Impact
Publically launched in the fall of 2015, the Early Warning Project can now leverage
advanced quantitative and qualitative analyses to provide governments, advocacy
groups and at-risk societies with assessments regarding the potential for mass
atrocities around the world. Integration of the project’s statistical risk assessment
models and expert opinion pool created a publicly available source of invaluable
information and positioned Data Science at the center of global diplomacy.
The data management framework developed from the lessons learned of the hack-
a-thon represents a great leap forward for the Holocaust Museum. The periodicity
of aggregating and transforming data was reduced from twice per year to once
per week. In addition to providing the community with more up-to-date data, the
reduced burden on researchers enables them to spend more time analyzing data and
identifying new and emergent trends. The extensible framework will also allow the
Holocaust Museum to seamlessly integrate new datasets as they become available or
are identified by the community as holding analytic value for the problem at hand.
Through this project, the Holocaust Museum was able to shift the dynamic from
monitoring ongoing violence to determining where it is likely to occur 12 to 24 months
into the future by integrating advanced quantitative and qualitative analyses to assess
the potential for mass atrocities around the world. The Early Warning Project is an
invaluable predictive resource supporting the global diplomatic dialogue. While the
focus of this effort was on the machine learning and data management technologies
behind the initiative, it demonstrates the growing role the Data Science community is
playing at the center of global diplomatic discussions.
In the past, conventional statistics have worked well for analyzing the impact of
direct marketing promotions on purchase behavior. Today, modern multi-channel
promotions often result in datasets that are highly dimensional and sometimes
sparse, which strains the power of conventional statistical methods to accurately
estimate the effect of a promotion on individual purchase decisions. Because of
the growing frequency of multi-channel promotions, IHG was driven to investigate
new approaches. In particular, IHG and Booz Allen studied one recent promotional
campaign using hotel, stay, and guest data for a group of loyalty program customers.
Our Approach
Working closely with IHG experts, Booz Allen investigated three key elements related
to different stages of analytic maturity:
Describe: Using initial data mining, what a probabilistic Bayesian Belief Network
insights or tendencies in guest behaviors (BBN) can learn the pairwise relationships
can be identified after joining multiple, between all individual customer attributes
disparate datasets? and their impact on promotional
return, Booz Allen investigated how
Discover: Can we determine which
this technique could be used to model
control group members would be likely to
each treated customer without an exact
register for a promotion if offered? If so,
controlled look-alike.
can we also quantify their registration?
Predict: How would a hotel guest that Specifically, Booz Allen developed a BBN
received the promotion have responded to predict customer-by-customer impacts
if they were not offered the promotion? driven by promotional campaign offers,
How would a hotel guest that did not subsequently estimating the aggregated
receive the promotion have responded ROI of individual campaigns. We used six
if they were offered the promotion? machine learning techniques (support
vector machine, random forest, decision
For the promotion that was the focus tree, neural network, linear model, and
of this case study, not everything about AdaBoost) in unison with the BBN to
customers could be controlled as required predict how each customer would be
by traditional statistics. However, because influenced by a promotional offer.
T H E F I ELD G U I D E to D A T A S C I E N C E
Our Impact
Because Booz Allen and IHG’s approach enabled estimation of ROI for each
hypothetical customer, even when no exact look-alikes exist, there are a number
of valuable future applications. One such application is optimal campaign design—
the ability to estimate the promotional attributes for an individual customer
that are likely to drive the greatest incremental spend. Another application is
efficient audience selection - which would reduce the risk of marketing “spam”
that prompts costly unsubscriptions and can negatively impact a hotel’s brand.
›› IoT including sensor analytics, smart data, and emergent discovery alerting and response
›› Customer Engagement and Experience including 360-degree view, gamification,
and just-in-time personalization
›› Smart X, where X = cities, highways, cars, delivery systems, supply chain, and more
›› Precision Y, where Y = medicine, farming, harvesting, manufacturing, pricing, and more
›› Personalized Z, where Z = marketing, advertising, healthcare, learning, and more
›› Human capital (talent) and organizational analytics
›› Societal good
Algorithms
THE FUTURE OF Applications
DATA SCIENCE
Thank you for taking this journey with us. Please join our
conversation and let your voice be heard. Email us your ideas
and perspectives at [email protected] or submit them
via a pull request on the Github repository.
Tell us and the world what you know. Join us. Become an
author of this story.
5. Davenport, Thomas H., and D.J. Patil. “Data Scientist: The Sexiest Job
of the 21st Century.” Harvard Business Review 90.10 (October 2012):
70–76. Print.
T H E F I ELD G U I D E to D A T A S C I E N C E
11. Booz Allen Hamilton. Cloud Analytics Playbook. 2013. Web. Accessed 15
October 2013. SSRN: <http://www.boozallen.com/media/file/Cloud-
playbook-digital.pdf>
12. Conway, Drew. “The Data Science Venn Diagram.” March 2013.
Web. Accessed 15 October 2013. SSRN: <http://drewconway.com/
zia/2013/3/26/the-data-science-venn-diagram>
13. Booz Allen Hamilton. Tips for Building a Data Science Capability. 2015.
Web Accessed 2 September 2015. SSRN: < https://www.boozallen.com/
content/dam/boozallen/documents/2015/07/DS-Capability-Handbook.pdf>
15. Torán, Jacobo. “On the Hardness of Graph Isomorphism.” SIAM Journal
on Computing. 33.5 (2004): 1093-1108. Print.
16. Guyon, Isabelle and Andre Elisseeff. “An Introduction to Variable and
Feature Selection.” Journal of Machine Learning Research 3 (March
2003):1157-1182. Print.
18. Haykin, Simon O. Neural Networks and Learning Machines. New Jersey:
Prentice Hall, 2008. Print.
20. Yacci, Paul, Anne Haake, and Roger Gaborski. “Feature Selection of
Microarray Data Using Genetic Algorithms and Artificial Neural
Networks.” ANNIE 2009. St Louis, MO. 2-4 November 2009.
Conference Presentation.