Lineate Data Rich Dashboards Ebook
Lineate Data Rich Dashboards Ebook
How Data-Rich
Dashboards Have
Changed The
Game
Addressing complexity in the
data-rich dashboards of today.
1
CONTENTS
2
What is a data-rich
dashboard?
5 7 9
The twin concerns Challenges of Managing
of exploring data in querying event level complexity using
real time data drill paths
13 16
Optimizing for time Conclusion
and space
2
Introduction
The amount of data collected by organizations is exploding, growing at an expected 26%
compound annual growth rate between 2015 and 2025. The emerging influx of data com-
prises event-level information from various applications, user devices, sensors, and low-level
operational processes. It has the potential to open doors to new ways to grow business,
ranging from personalization, predictive optimization, tailored product recommendations,
and countless opportunities for direct relationships with customers.
A crucial step towards this is to aggregate and visualize the data in ways that assist in de-
veloping these processes, be they data science hypotheses, optimization paths, or product
ideas. But such aggregation and visualization is difficult with large amounts of semi-struc-
tured data.
It requires that we prepare the data such that we can explore it almost instantaneously,
and that we preplan the interactions from both user experience and data optimization
perspectives. 1
1
What is a data-rich
dashboard?
A dashboard is distinct from a report in that it provides
multiple views into the data and allows a user to explore it
interactively. Traditionally, dashboards were a convenient
set of high-level metrics organized to conveniently give
groups of users what they need at a glance. The data-rich
dashboard takes this same idea and incorporates into it
low-level, semi-structured events. These events were often
not captured at all in the past, or at best been dumped into
a data lake for occasional ad-hoc queries, or summarized
as high-level aggregates and exposed as KPIs. A data-rich
dashboard exposes these low-level events at scale. It takes
granular data points as they flow in, and provides an immer-
sive and extremely responsive way to make sense of them.
Dashboards
Dashboards
3
3
When we talk about making dashboards data-rich, we’re focusing on the latter two. Both
operational and analytic use cases offer the potential for people to quickly and seamlessly
explore data down to extremely granular events.
For operations, this can mean providing point-in-time and near real-time snapshots of opera-
tions to adjust or optimize parameters on the fly (e.g. retargeting an advertising campaign to
where it is performing best.
For analytics, it can mean providing visualization tools for your data scientists to assist in hy-
pothesis generation, or it can provide the rest of users and understanding of why algorithms
are doing what they are, and to help them make sense of the flood of aggregated information.
Data-rich dashboards
4
4
The twin concerns of
exploring data in real time
For both Operational and Analytic dashboards, it is imperative that users be able to interact
with the dashboard and get an immediate response. Load delays or clunky interactions will
quickly render using the system a chore, defeating the purpose of making it easy to identify
trends to inform decisions. The central challenge in building them involves how low-level
event data can be presented intuitively and quickly. We do this through preplanning how
data is going to be accessed, and optimizing data retrieval paths along those dimensions.
Before we get into how we do that, we’ll state the two rules of thumb we would want for any
data-rich dashboard.
5
5
2 ‘Rules of Thumb’
#1 #2
User interactions should Exploration should be natural
have immediate results. and intuitive.
Every action a user takes should result The dashboard must be optimized and de-
in the dashboard fully updated within signed around the activities that users expect
500 milliseconds. Since any attempt to to occur without interruption. For example,
drill down or query change can result in when rendering geospatial data, we assume
reloading multiple components across the user will start scrolling through the maps,
the dashboard, these queries need to and the surrounding regions must be ready to
be able to run in parallel and complete be displayed as she does so. For event data,
within a specific time bound. The event we expect the user to refine the time window
data will not naturally fit to within these she is looking at and will only know the right
constraints, so we must preprocess and granularity when it appears. Since it is impos-
preaggregate so that we can succeed. sible to resolve completely arbitrary activities
in near real-time, we need to develop against
the use cases that matter.
In short, we must have the right data, in the right place, ready for the user’s next move.
6
6
Challenges of querying
event-level data
As we incorporate granular data into the dashboards, the data challenges around these
become considerable. An old-school dashboard might report on the level of individual sales
or conversions, which would probably involve handling up to somewhere on the order of
thousands of records per day. When we start to include lower-level user events (individual
device actions, operational logs, ad campaign impression-level data, etc.), it’s not unusual to
be dealing with millions or even billions per day. Even when the amount is lower, this kind of
data is typically less structured than traditional operational data, and often not conducive to
querying. Since we gave ourselves a mandate of responding immediately, we’ll need to care-
fully design our data architecture to make this a possibility.
There are powerful database products like Snowflake or Vertica which are designed specifi-
cally to handle analytical queries across large data volumes. These are often crucial compo-
nents in a data warehouse, and offer the ability to create and run complex queries over large
amounts of data.
They are not, however, designed to be able to run large numbers of queries in the half-second
latency requirement we set out for ourselves. This isn’t because of any flaw in the products
— they are general purpose analytical databases which are designed for flexibility. It is sim-
ply not possible to create a general-purpose database that can run any kind of query over a
large dataset in a fixed amount of time. Fortunately, a real-time dashboard is not attempting
to answer arbitrary questions, so there are other ways to solve this problem. 7
7
To achieve this, we must design the data inter- events as they flow in. The write load is by
actions in the planning stages. We carefully definition intensive, and performance of the
consider how users are likely to interact with data indexes needed to take that into account.
the data and what kinds of questions they will Compounding this is the fact that event data is
be asking. By designing the dashboard around presumably coming from multiple disconnected
such use cases, we can achieve enormous im- systems operating at different network tiers and
provements in performance and scale through connecting to to different devices. The nature
careful preaggregation of data. We then use of distributed systems means that events are
appropriate databases to create the custom unlikely to come in sequentially, so streaming
preaggregated indexes we need for the use the data into the aggregate indexes needs to be
cases we’ve designed. able to rewind and recover without impacting
dashboard performance. These are solvable
problems, but need to be considered at the time
In the case of event data, almost all interactions
the data architecture is designed.
begin with querying over a single time range.
So we start designing our data model through
period-based preaggregations of useful partial Querying large amounts of data in parallel is
values, and ensure we have a database that can fundamental to a data-rich dashboard. To do
query across time ranges efficiently (there are so effectively doesn’t require that everything
time series databases such as Apache Druid be known up front, but it does necessitate that
and InfluxDB that are often good choices, as data efficiency is built in to the development
are columnar databases such as Clickhouse process from the start. The data needs to be
or Cassandra, depending on the data and use preaggregated in ways it will be queried at scale
cases.) We know the user is going to select a and it needs to be streamed into this system
time range, and we know to preindex the events in a way that minimizes impact on user-facing
a single range within this range. We then take systems and the users of the dashboards them-
those preaggregated events and combine them selves. Latency targets need to be built into
in ways that depend on the action the user the development process as first class require-
takes. ments, with code-level timing assertions built in
throughout the engineering and testing phases.
The return on this is a system that allows users
These aggregates must be kept updated with
to work with and explore potentially huge pools
of data as if it were all immediately available.
8
8
Managing complexity
using drill paths
We just discussed that we need to properly preaggregate the data if we want to make a dash-
board immediately responsively. Done correctly, this enables us to make something imme-
diately responsive. But there is a problem here. The more dimensions we allow the user to
explore, the greater the burden of preindexing data becomes.
For each dimension we allow the user to explore, we need to maintain the set of preaggre-
gates that allow that data to be returned quickly into the dashboard. In addition to that,
since the dashboard allows users to drill down by various attributes as they use the dash-
board, it means that we need preindexed data not simply for every dimension we are expos-
ing, but for every combination of attributes they may invoke.
This can result in a “combinatorial explosion” of preprocessing and disk space as the number
of supported queries increases. Therefore, we need to control this at the application design
level. This is where the concept of “drill paths” comes in.
Driving
By defining a supported set of drill paths, we create a sandbox that allows us to manage the
number of dimensions we preaggregate against. We can then design a preaggregating data
indexing layer that is able to present this data with sub-second latency, allowing the dash-
board to respond immediately to the set of interactions we allow the user to drill down into.
10
10
Guiding the user with
contextual design cues
Drill paths provide the additional benefit of Every dashboard consists of a series of charts,
making the dashboard intuitive to use. With tables, indicators, and controls that illustrate
a data-rich dashboard, a large number of various properties of the entity being looked at.
semi-structured events are explored in differ- As a user interacts with one of these widgets,
ent ways, which generally requires a significant the other widgets dynamically adapt to provide
number of visual components that interact with further insight into the data (this is what makes
each other. This can get overwhelming fast. it a dashboard and not simply a report.) The
We can leverage the same drill paths we define user probably isn’t thinking explicitly in terms of
to provide visual cues of which components are drill paths, so the user experience should focus
conceptually linked to each other, and provide a on minimizing the mental work required to un-
natural way to drill down to explore the data in derstand which widgets correspond to which.
final detail. 11
11
To illustrate: Imagine a dashboard which contains a pie chart of conversions by region, and
a bar chart of conversions by state. These two widgets would be color coded in the same
way, to give the user a visual indication that they are related windows into the same drill path.
Clicking on a region in the pie chart would cause the bar chart to break down by states in
that region. Drilling down further in the pie chart might refresh to show breakdown by state,
and simultaneously update the bar chart to show cities within the state. Note that while it
would also be possible to click on a region to determine the regional sales by day of week,
this would break the natural visual correspondence between these two widgets. We would
instead use a separate set of components with a different drill path and color scheme, adding
more components but simplifying the interaction.
12
12
Optimizing for time
and space
If users are able to see the lowest level data available, they will
feel more comfortable extrapolating from it, even if they’re not
using that data directly on a day to day level. The nature of
this data allows for some standard optimizations we can do to
streamline the user experience.
13
13
Using maps effectively
Maps are, of course, the natural way to represent geospatial data and get intuitive insight
from it. If you’re showing someone a map, you should expect them to spend significant time
drilling up, down, and scrolling around as they look for patterns. Since there is a potentially
huge amount of data underlying the maps (at minimum, the considerable data needed to
render the map itself), this is a common place for pitfalls in implementing a dashboard.
14
14
Use columnar data storage Plan for multi-tenancy
Since every event takes place at a specific date Even though a dashboard may be envisioned as
and time, almost every query across events will an internally facing tool only, they tend to evolve
be hitting a time period in one way or another. in unexpected ways, and sometimes need to
That means that minimizing the work involved be published more broadly. A lot of the recom-
in scanning a range of dates is of paramount im- mendations we made around drill paths come
portance. We have found columnar databases from the experience of seeing different groups
such as Apache Druid or Clickhouse to be good of users interrelating with the data in differ-
choices. If we are able to cluster time entry ent ways. By planning for the supported and
records physically close to each other when we unsupported dimensions up front, it adds great
write them, it makes scanning across them to flexibility downstream. Even if the dashboard is
resolve queries along various dimensions vastly never published, designing as if it might be will
more efficient than we might otherwise achieve likely result in a better thought out visualization.
using a standard relational index.
Lineate likes to use GraphQL as protocol for transferring data to a user interface. We find it
works especially well for dashboards. As dashboards get richer, it becomes less likely that
everything desired is represented by a formal schema beneath them. When we define drill
paths, we’re in effect defining various projections of this semi-structured data. GraphQL
provides a nice way of modeling the drill paths accessible to the dashboard. On top of this,
GraphQL provides a nice heuristic for only pulling the specific data needed to render a view,
which is ideal for rapid development and limiting bandwidth.
15
15
Conclusion
The explosion of data over the past handful of years has triggered the need for advanced
dashboards that go beyond providing a business intelligence tool to departmental stakehold-
ers. The large amount of semi-structured data associated with events makes building da-
ta-rich dashboards and quantitatively different exercise than building traditional dashboards.
Done correctly, it’s an interactive window in which people explore and thrive.
They key in making it all work is being fully immersive and interactive. Each component
needs to update with very low latency, and the relationship between data needs to be intui-
tive and transparent.
• Have all the data ready to be served, as soon as the user needs it
• Plan for the specific drill paths in which the data will be explored
• Make it clear and intuitive how each widget impacts every other
16
16
THANK YOU
lineate.com/contact-us
17
17