0% found this document useful (0 votes)

32 views12 pages

Pinot

Pinot is a real-time OLAP system developed by LinkedIn to handle analytical queries for 530 million users, providing high throughput and low latency for data ingestion and query execution. It addresses the challenges of traditional databases by supporting near-realtime data ingestion from streaming sources and enabling complex analytical queries at scale. The document compares Pinot's performance with similar systems like Druid and outlines its architecture, data model, and operational capabilities.

Uploaded by

mailtoamar933

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views12 pages

Pinot

Uploaded by

mailtoamar933

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Pinot: Realtime OLAP for 530 Million Users

Jean-François Im Kishore Gopalakrishna Subbu Subramaniam

LinkedIn Corp. LinkedIn Corp. LinkedIn Corp.
Mountain View, California Mountain View, California Mountain View, California
[email protected] [email protected] [email protected]

Mayank Shrivastava Adwait Tumbde Xiaotian Jiang

LinkedIn Corp. LinkedIn Corp. LinkedIn Corp.
Mountain View, California Mountain View, California Mountain View, California
[email protected] [email protected] [email protected]

Jennifer Dai Seunghyun Lee Neha Pawar

LinkedIn Corp. LinkedIn Corp. LinkedIn Corp.
Mountain View, California Mountain View, California Mountain View, California
[email protected] [email protected] [email protected]

Jialiang Li Ravi Aringunram

LinkedIn Corp. LinkedIn Corp.
Mountain View, California Mountain View, California
[email protected] [email protected]

ABSTRACT ACM Reference Format:

Modern users demand analytical features on fresh, real time data. Jean-François Im, Kishore Gopalakrishna, Subbu Subramaniam, Mayank
Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee,
Offering these analytical features to hundreds of millions of users
Neha Pawar, Jialiang Li, and Ravi Aringunram. 2018. Pinot: Realtime OLAP
is a relevant problem encountered by many large scale web com- for 530 Million Users. In SIGMOD’18: 2018 International Conference on Man-
panies. agement of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY,
Relational databases and key-value stores can be scaled to pro- USA, 12 pages. https://doi.org/10.1145/3183713.3190661
vide point lookups for a large number of users but fall apart at the
combination of high ingest rates, high query rates at low latency
for analytical queries. Online analytical databases typically rely on
bulk data loads and are not typically built to handle nonstop oper-
ation in demanding web environments. Offline analytical systems 1 INTRODUCTION
have high throughput but do not offer low query latencies nor can Modern web companies generate large amounts of data, and in-
scale to serving tens of thousands of queries per second. creasingly sophisticated end users demand to be able to analyze
We present Pinot, a single system used in production at Linkedin ever growing volumes of data. Doing so at scale, with interactive-
that can serve tens of thousands of analytical queries per second, level performance requires sophisticated solutions to deliver the
offers near-realtime data ingestion from streaming data sources, responsiveness that users have come to expect.
and handles the operational requirements of large web properties. We postulate that the key requirements for a scalable near-realtime
We also provide a performance comparison with Druid, a system OLAP service are as follows:
similar to Pinot.
Fast, interactive-level performance Users are not ready to wait
CCS CONCEPTS for extended periods of time for query results, as this breaks
• Information systems → Relational parallel and distributed the tight interaction loops needed to properly explore data;
DBMSs; • Software and its engineering → Cloud computing; Scalability A scalable service should provide near-linear scaling
and fault tolerance to handle the demanding operational re-
quirements of large scale web scale deployments in order
Permission to make digital or hard copies of all or part of this work for personal or to accomodate high numbers of concurrent queries while
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita- ingesting large amounts of data in a near-realtime fashion;
tion on the first page. Copyrights for components of this work owned by others than Cost-effectiveness As data volumes and query rates keep on in-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- creasing, the costs to serve user requests cannot grow un-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. bounded, often requiring colocation of different use cases;
SIGMOD’18, June 10–15, 2018, Houston, TX, USA Low data ingestion latency Users expect to be able to query data
© 2018 Association for Computing Machinery. points that have been added recently in a near-realtime fash-
ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00
https://doi.org/10.1145/3183713.3190661 ion without having to wait for batch jobs to load data;
Flexibility Users demand the ability to drill down arbitrarily with- as newsfeed customization and “Who viewed my profile” to more
out being bound by pre-aggregated results as well as intro- complicated dashboards for advertiser analytics and internal ana-
duce new fixed query patterns in production without down- lytics. We present our findings from operating Pinot at scale, and
time; compare the performance and scalability of the different indexing
Fault tolerance System failures should cause a graceful degrada- techniques implemented in Pinot on production workloads. We
tion for end users; also compare the performance of Pinot with Druid [30], an ana-
Uninterrupted operation The service should operate continu- lytical system with an architecture similar to Pinot.
ously, without downtime for upgrades or schema changes;
Cloud-friendly architecture The service should be easily deploy-
able within the constraints of commercially available cloud 2 RELATED WORK
services. We now survey different approaches to online analytics.
Interactive-level responsiveness is an important problem, being Traditional row-oriented RDBMS can process OLAP-style queries,
identified as a key challenge by various authors [8, 18]. Approaches albeit inefficiently. This has the advantage of allowing analytical
such as MapReduce [10] and Spark [31] have adequate throughput queries on the same system used for OLTP and avoiding to incur
but “their high latency and lack of online processing limit fluent the maintenance costs of multiple systems. At smaller web compa-
interaction [15].” Furthermore, sophisticated users expect respon- nies, one oft-encountered configuration is to build a read replica
sive dashboards and complex data visualizations that allow quick of the OLTP database, which is then used for analytical purposes.
drill downs [16]. However, as data volumes increase, the cost of maintaining per-
At web scale, scalability is a critical concern; any solution that column indexes limits ingest rates for such setups. Furthermore,
does not offer near-linear scaling will eventually need to be re- as OLTP databases are often normalized while OLAP analysis is
placed when the requirements exceed the scaling capacity. Most significantly faster using star schemas, snowflake schemas, or fully
large-scale distributed data systems satisfy this near-linear scaling denormalized tables, the performance of analytics on such systems
capability and Pinot is no exception to this. eventually becomes unacceptable.
Cost effectiveness is also a key concern in a scalable query en- Column stores, such as MonetDB [5], C-Store [26], Vertica [21],
gine. While technically it is possible to lower query latency and and many others offer significant performance improvements over
increase throughput by “throwing hardware at the problem”, do- row-oriented stores for analytical queries. As analytical queries
ing so quickly becomes prohibitively expensive when operating tend to scan large amounts of data for a subset of all columns, col-
at large scale. Performance is intimately tied to cost effectiveness; umn stores improve the query performance by avoiding transfer-
hardware resources that are currently processing a query are un- ring data that is not used to compute the query result, and allow
available for other queries, by definition. Thus, increasing perfor- for compression and various optimizations [2, 14] that are not pos-
mance also has the side effect of improving cost effectiveness. sible in row-oriented data stores. However, column stores do not
Ingestion latency is another important facet of analytics. Many do very well for certain operations, such as single-row inserts, up-
commercially available systems for analytics cannot handle single dates, and point lookups.
row loads and rely on bulk loads for data ingestion. This has the Some newer databases, such as SAP HANA [12], DB2 BLU [25],
side effect of increasing the time from a business event happening Oracle Database in-memory [20], and MemSQL [9], integrate both
and it being detectable by analysts. row-oriented and columnar execution within the same database. In
An analytical system also needs to be flexible; limiting users to these hybrid transactional/analytical processing databases, tables
predefined combinations of dimensions when drilling down or oth- can be row or column oriented — or even both — allowing users to
erwise preventing users from accessing record-level details ham- mitigate the drawbacks of either orientation by using the optimal
pers the data analysts in their daily work, causing them to query data orientation for their needs.
other systems to get more granular information. It also needs to be At large scale, “offline” approaches to OLAP are also used. Sys-
reconfigurable in order to allow for changing requirements from tems such as Hive [27], Impala [4], and Presto [1] offer distributed
users. query execution on large data sets. Such systems do not keep user
Furthermore, different use cases for near-realtime OLAP sys- data resident in memory between queries and instead rely on op-
tems have different complexity and throughput requirements. Dash- erating system caches to speed up repeated queries on the same
boards for millions of end users might be limited to one or two data set; their execution model is to execute queries on a set of dis-
facets due to the large amount of incoming queries, while lower tributed worker nodes. As shown in the performance evaluation
throughput use cases will typically feature more complex drill down of Spark [31], the cost of loading data from storage for each query
features. is significant. Furthermore, as such systems do not keep data resi-
Finally, fault tolerance and continuous operation are required to dent in memory, they cannot handle data that has not been written
satisfy the demands of today’s continuously operating web prop- to durable distributed storage; in practice, this translates to a data
erties; as these web properties are available globally, there is no availability gap between transactional systems and analytical sys-
good window for system maintenance downtime. tems. Finally, such systems have a relatively high per-query set up
Given these criteria, we present Pinot, a single system that serves time (from hundreds of milliseconds to several dozens of seconds),
near-realtime analytics for all Linkedin users, spanning from sim- precluding the execution of tens of thousands of queries per sec-
ple high throughput queries for large numbers of end-users such ond.
There are also large performance gains to be had by building running Pinot in production at LinkedIn for many years across
more specialized systems. For example, Figure 6 of [17] shows how hundreds of servers and tables processing thousands of queries per
a specialized system can have order of magnitude improvements second. Importantly, Pinot is used to power customer facing appli-
in performance over more general approaches. One way to im- cations such as “Who viewed my profile” (WVMP) and newsfeed
prove the performance of OLAP system is to preaggregate data customization which require very low latency as well as internal
into cubes, then perform query execution on the preaggregated business analyst dashboards where users want to slice and dice
data. As each cube can contain an unbounded number of rows, data.
this can offer performance improvements of several orders of mag- At Linkedin, business events are published in Kafka streams and
nitude. At large scale, the cubes can be stored in distributed key- are ETL’ed onto HDFS. Pinot supports near-realtime data inges-
value stores [28, 32]. tion by reading events directly from Kafka [19] as well as data
However, these performance improvements come at the expense pushes from offline systems like Hadoop. As such, Pinot follows
of query flexibility; queries that contain dimensions for which there the lambda architecture [23], transparently merging streaming data
is no preexisting aggregate cannot be executed. This limits the from Kafka and offline data from Hadoop. As data on Hadoop is a
range of queries to certain combinations of dimensions. Further- global view of a single hour or day of data as opposed to a direct
more, some resolution is lost in the preaggregation, so filtering stream of events, it allows for the generation of more optimal seg-
based on timestamps with fine granularity or computing exact val- ments and aggregation of records across the time window.
ues for summary statistics that require the original data (median,
distinct count, etc.) are not possible. 3.1 Data and Query Model
Another example of a specialized system is Druid [30]. Similar
Just like typical databases, data in Pinot consists of records in ta-
to Pinot, Druid is an asynchronous low latency ingestion analytical
bles. Tables have a fixed schema composed of multiple columns.
store. Unlike transactional systems, data is loaded asynchronously
Supported data types are integers of various lengths, floating point
in Druid by writing it into a queuing system (Kafka is used for both
numbers, strings and booleans. Arrays of the previous types are
Druid and Pinot). The asynchronous loading allows the producer
also supported. Each column can be either a dimension or a met-
of events to quickly write desired business events to the queuing
ric.
system without waiting for a transaction to complete. This means
Pinot also supports a special timestamp dimension column called
that critical front-end transactional processing is not blocked on
a time column. The time column is used when merging offline and
back-end analytical system availability. Ingestion in Druid is also
realtime data as explained in section 3.3.3 and for managing auto-
low latency, as indexing of events happens shortly after being writ-
matic data expiration.
ten to Kafka as opposed to being bulk loaded periodically.
Tables are composed of segments, which are collections of records.
Both Druid and Pinot share similar architectural choices; query
A typical Pinot segment might have a few dozen million records
execution is distributed, data is ingested asynchronously from stream-
and tables can have tens of thousands of segments. Segments are
ing sources, and both trade off strong consistency for eventual
replicated, ensuring data availability. Data in segments is immutable,
timeline consistency. However, unlike Druid, Pinot has been opti-
although segments themselves can be replaced with a newer ver-
mized for handling both high throughput serving of simple analyti-
sion; this allows for updates and corrections to existing data.
cal queries and more complex analytical queries at lower rates. Fur-
Data orientation in Pinot segments is columnar. Various encod-
thermore, the integration of additional specialized data structures
ing strategies are used to minimize the data size, including dictio-
and certain optimizations, as described in section 4, allows for high
nary encoding and bit packing of values. Inverted indexes are also
throughput serving of low complexity aggregation queries at tens
supported. A typical segment is a few hundred megabytes up to a
of thousands of queries per second in production environments.
few gigabytes.
Querying in Pinot is done through PQL, a subset of SQL. PQL
Table 1: A comparison of the techniques for OLAP and their is modeled around SQL and supports selection, projection, aggre-
applicability to large scale serving. gations, and top-n queries, but does not support joins or nested
queries. PQL does not offer any DDL nor record-level creation, up-
Technique Fast ingest High query Query flexi- Query la- dates or deletion.
and indexing rate bility tency
RDBMS Not typically Yes High Low/moderate
KV stores Yes Yes None Low
Online OLAP No Not typically High Low/moderate”
“Offline” OLAP No No High High
Druid Yes No Moderate Low/moderate
Pinot Yes Yes Moderate Low

3 ARCHITECTURE
Pinot is a scalable distributed OLAP data store developed at LinkedIn
to deliver real time analytics with low latency. Pinot is optimized Figure 1: Pinot Segment
for analytical use cases on immutable append-only data and offers
data freshness that is on the order of a few seconds. We have been
3.2 Components Minions are responsible for running compute-intensive mainte-
Pinot has four main components for data storage, data manage- nance tasks. Minions execute tasks assigned to them by the con-
ment, and query processing: controllers, brokers, servers, and min- trollers’ job scheduling system. The task management and sched-
ions. Additionally, Pinot depends on two external services: Zoo- uling is extensible to add new job and schedule types in order to
keeper and a persistent object store. Pinot uses Apache Helix [13] satisfy evolving business requirements.
for cluster management. Apache Helix is a generic cluster man- An example of a task that is run on the minions is data purg-
agement framework that manages partitions and replicas in a dis- ing. Linkedin must sometimes purge member-specific data in or-
tributed system. der to comply with various legal requirements. As data in Pinot is
Servers are the main component responsible for hosting seg- immutable, a routine job is scheduled to download segments, ex-
ments and processing queries on those segments. A segment is punge the unwanted records, rewrite and reindex the segments be-
stored as a directory in the UNIX filesystem consisting of a seg- fore finally uploading them back into Pinot, replacing the previous
ment metadata file and an index file. The segment metadata pro- segments.
vides information about the set of columns in the segment, their Zookeeper is used as a persistent metadata store and as the com-
type, cardinality, encoding, various statistics, and the indexes avail- munication mechanism between nodes in the cluster. All informa-
able for that column. An index file stores indexes for the all the tion about the cluster state, segment assignment and metadata is
columns. This file is append-only which allows the server to create stored in Zookeeper though Helix. Segment data itself is stored
inverted indexes on demand. Servers have a pluggable architecture in the persistent object store. At Linkedin, Pinot uses a local NFS
that supports loading columnar indexes from different storage for- mountpoint for data storage but we have also used Azure Disk stor-
mats as well as generating synthetic columns at runtime. This can age when running outside of Linkedin’s datacenters.
be easily extended to read data from distributed filesystems like
HDFS or S3. We maintain multiple replicas of a segment within a 3.3 Common Operations
datacenter for higher availability and query throughput. All repli- We now explain how common operations are implemented in Pinot.
cas participate in query processing.
3.3.1 Segment Load. Helix uses state machines to model the
cluster state; each resource in the cluster has its own current state
and desired cluster state. When either state changes, the appropri-
ate state transitions are sent to the respective nodes to be executed.
Pinot uses a simple state machine for segment management, as
shown in Figure 3. Initially, segments start in the OFFLINE state
and Helix requests server nodes to process the OFFLINE to ON-
LINE transition. In order to handle the state transition, servers
fetch the relevant segment from the object store, unpack it, load
it, and make it available for query execution. Upon the completion
of the state transition, the segment is marked as being in the ON-
LINE state in Helix.
For realtime data that is to be consumed from Kafka, a state
transition happens from the OFFLINE to CONSUMING state. Upon
processing this state transition, a Kafka consumer is created with a
Figure 2: Pinot Cluster Management given start offset; all replicas for this particular segment start con-
suming from Kafka at the same location. A consensus protocol de-
Controllers are responsible for maintaining an authoritative map- scribed in section 3.3.6 ensures that all replicas converge towards
ping of segments to servers using a configurable strategy. Con- an exact copy of the segment.
trollers own this mapping and trigger changes to it on operator
requests or in response to the changes in server availability. Addi-
tionally, controllers support various administrative tasks such as OFFLINE
listing, adding, or deleting tables and segments. Tables can be con-
figured to have a retention interval after which segments past the
retention period are garbage collected by the controller. All the
DROPPED CONSUMING
metadata and mapping of segments to servers is managed using
Apache Helix. For fault tolerance, we run three controller instances
in each datacenter with a single master; non-leader controllers are
mostly idle. Controller mastership is managed by Apache Helix. ONLINE
Brokers route incoming queries to appropriate server instances,
collect partial query responses, merge them into a final result, which
is then sent back to the client. Pinot clients send their queries to Figure 3: Pinot Segment State Machine
brokers over HTTP, allowing for load balancers to be placed in
front of the pool of brokers.
Figure 4: Pinot Segment Load Figure 5: Query Planning Phases

3.3.2 Routing Table Update. When segments are loaded and un- that is shared between the offline and realtime tables. In practice,
loaded, Helix updates the current cluster state. Brokers listen to we have not found this to be an onerous requirement, as most data
changes to the cluster state and update their routing tables, a map- written to streaming systems tend to have a temporal component.
ping between servers and available segments. This ensures that
brokers are routing queries to replicas that are available as new
replicas come online or are marked as unavailable. The process of
routing table creation is described in more detail in section 4.4.
3.3.3 Query Processing. When a query arrives on a broker, sev-
eral steps happen:
(1) The query is parsed and optimized
(2) A routing table for that particular table is picked at random
(3) All servers in the routing table are contacted and asked to
process the query on a subset of segments in the table Figure 6: Hybrid Query Rewriting
(4) Servers generate logical and physical query plans based on
index availability and column metadata
(5) The query plans are scheduled for execution 3.3.4 Server-Side Query Execution. On the server-side, when a
(6) Upon completion of all query plan executions, the results query is received, logical and physical query plans are generated.
are gathered, merged and returned to the broker As available indexes and physical record layouts can be different
(7) When all results are gathered from the servers, the partial between segments, query plans are generated on a per-segment ba-
per-server results are merged together. Errors or timeouts sis. This allows Pinot to do certain optimizations for special cases,
during processing cause the query result to be marked as such as a predicate matching all values of a segment. Special query
partial, so that the client can choose to either display in- plans are also generated for queries that can be answered using seg-
complete query results to the user or resubmit the query at ment metadata, such as obtaining the maximum value of a column
a later time. without any predicates.
(8) The query result is returned to the client Physical operator selection is done based on an estimated exe-
cution cost and operators can be reordered in order to lower the
Pinot supports dynamically merging data streams that come
overall cost of processing the query based on per-column statistics.
from offline and realtime systems. To do so, these hybrid tables
The resulting query plans are then submitted for execution to the
contain data that overlaps temporally. Figure 6 shows that a hypo-
query execution scheduler. Query plans are processed in parallel.
thetical table with two segments per day might have overlapping
data for August 1st and 2nd . When such a query arrives in Pinot, 3.3.5 Data Upload. To upload data, segments are uploaded to
it is transparently rewritten into two queries; one query for the the controller using HTTP POST. When a controller receives a seg-
offline part, which queries data prior to the time boundary, and a ment, it unpacks it to ensure its integrity, verifies that the segment
second one for the realtime part, which queries data at or after the size would not put the table over quota, writes the segment meta-
time boundary. data in Zookeeper, then updates the desired cluster state by assign-
When both queries complete, the results are merged, allowing ing the segment to be in the ONLINE state on the appropriate num-
us to cheaply provide merging of offline and realtime data. In order ber of replicas. Updating the desired cluster state then triggers the
for this scheme to work, hybrid tables require having a time column segment load as described earlier.
the same exact data; however, two consumers consuming for a cer-
tain amount of time based on their local clock will likely diverge.
As such, Pinot has a segment completion protocol that ensures that
independent replicas have a consensus on what the contents of the
final segment should be.
When a segment completes consuming, the server starts polling
the leader controller for instructions and gives its current Kafka
offset. The controller then returns a single instruction to the server.
Possible instructions are:
HOLD Instructs the server to do nothing and poll at a later time
DISCARD Instructs the server to discard its local data and fetch
an authorative copy from the controller; this happens if an-
other replica has already successfully committed a different
version of the segment
CATCHUP Instructs the server to consume up to a given Kafka
offset, then start polling again
KEEP Instructs the server to flush the current segment to disk and
load it; this happens if the offset the server is at is exactly
the same as the one in the committed copy
COMMIT Instructs the server to flush the current segment to disk
and attempt to commit it; if the commit fails, resume polling,
otherwise, load the segment
NOTLEADER Instructs the server to look up the current cluster
leader as this controller is not currently the cluster leader,
then start polling again
Replies by the controller are managed by a state machine that
Figure 7: Query Planning Phases
waits until enough replicas have contacted the controller or enough
time has passed since the first poll to determine a replica to be the
committer. The state machine attempts to get all replicas to catch
up to the largest offset of all replicas and picks one of the replicas
with the largest offset to be the committer. On controller failure, a
new blank state machine is started on the new leader controller;
this only delays the segment commit, but otherwise has no effect
on correctness.
This approach minimizes network transfers while ensuring all
replicas have identical data when segments are flushed.

3.4 Cloud-Friendly Architecture

Pinot has been specifically designed to be able to run on cloud in-
frastructure. Commercially available cloud infrastructure providers
provide the two important ingredients required for Pinot execu-
tion: a compute substrate with local ephemeral storage and a durable
object storage system.
Figure 8: Pinot Data Upload As such, Pinot has been designed as a share-nothing architec-
ture with stateless instances. In Pinot, all persistent data is stored
in the durable object storage system and system metadata is stored
3.3.6 Realtime Segment Completion. In Pinot, realtime data con- in Zookeeper; local storage is only used as a cache and can be recre-
sumption from Kafka happens on independent replicas. Each replica ated by pulling data from the durable object storage or from Kafka.
starts consuming from the same start offset and has the same end As such, any node can be removed at any time and replaced by a
criteria for the realtime segment. When the end criteria is reached, blank one without any issues.
the segment is flushed to disk and committed to the controller. As Furthermore, all user-accessible operations for Pinot are done
Kafka retains data only for a certain period of time, Pinot supports through HTTP, allowing users to leverage existing battle-tested
flushing segments after a configurable number of records and after load balancers — such as HAProxy or nginx — or client-side soft-
a configurable amount of time. ware load balancers like Linkedin’s D2.
Independent consumers consuming from the same Kafka offset This cloud-friendly architecture has allowed us to trivially port
and partition for the exact same number of records will consume Pinot to be ran on off the shelf container execution services with
only the code changes required to interface with the cloud provider’s is sufficient to answer the question the analyst had. This is espe-
object storage system. Such an architecture allows for easy deploy- cially important in data sets which have a long tail distribution
ment and scaling using container managers such as Kubernetes. and analysts that are mostly concerned with “moving the needle”
on key metrics. Such queries happen frequently in dashboarding
4 SCALING PINOT use cases.
Iceberg cubing [3] expands on OLAP cubes by adapting them to
Several features of Pinot were essential to get acceptable perfor-
answer iceberg queries. Further work on iceberg cubing brought
mance at scale. We cover these features and explain how they en-
several advances, such as star-cubing [29], which improves the ice-
able Pinot to serve analytical queries for Linkedin’s users.
berg cubing technique by making it more efficient to compute in
comparison to other iceberg cubing approaches for most cases. In
4.1 Query Execution star-cubing, a pruned hierarchical structure of nodes called a star-
Pinot’s query execution model has been designed to accommodate tree is constructed and can be efficiently traversed to answer queries.
new operators and query shapes. For example, the initial version Star-trees consist of nodes of preaggregated records; each level
of Pinot did not support metadata-only queries for queries such as of the tree contains nodes that satisfy the iceberg condition for a
SELECT COUNT(*). Adding support for such queries involved a few given dimension and a star-node which represents all data for this
changes to the query planner and adding a new metadata-based particular level. Navigating the tree allows answering queries with
physical operator, but did not require any architectural changes. multiple predicates. For example, Figure 9 shows a simple query
Pinot’s physical operators are specialized for each data represen- to obtain the sum of all impressions with a simple predicate; to an-
tation; there are operators for each different data encoding. This swer the query, each level of the tree is navigated until finding the
flexibility allows us to add new index types and specialized data node that contains the aggregated data that answers this specific
structures for query optimization. As we can reindex data on the query. Figure 10 shows a more complex query with an or predicate
fly on servers themselves or through the minion subsystem, it is for which multiple navigations are required to answer the query.
possible for us to deploy new index types and encodings without
users of Pinot being aware of such changes.

4.2 Indexing and Physical Record Storage

Similar to Druid, we support bitmap-based inverted indexes. How-
ever, we have observed that physically reordering the data based
on primary and secondary columns allows certain types of queries
to run significantly faster.
For example, for the “Who viewed my profile” feature of the
Linkedin website, all queries have a filter on the vieweeId column.
As such, physically reordering the records based on the vieweeId
column ensures that for any given query, only a contiguous sec-
tion of the column needs to be considered, allowing Pinot to store
only the start and end index into the column position for any given
vieweeId. This adjacency also makes it possible to use vectorized
query execution in the case where there are no other query predi-
cates.
Figure 9: select sum(Impressions) from Table where
In the case where vectorized query execution is not possible, we
Browser = ’firefox’
have observed that falling back to iterator-style scan query execu-
tion on a range of the column leads to better query performance
than trying to perform bitmap operations on large bitmap indexes. We have implemented star-trees in Pinot and routinely use them
As such, when creating physical filter operators, the ones oper- to speed up analytical queries for our internal data analytics tools.
ating on the physically sorted column are executed first and pass As Pinot determines which physical query execution nodes can be
on their column range to subsequent operators. This causes sub- applied given the indexes available, if a user specifies a query that
sequent operators to only evaluate part of the column, greatly im- can be optimized by using the star-tree structure, we transparently
proving performance. use it to return pre-aggregated values; otherwise, query execution
runs on the original unaggregated data.
4.3 Iceberg Queries
An important class of queries are iceberg queries [11], where only 4.4 Query Routing and Partitioning
aggregates that satisfy a certain minimum criteria are returned. For In Pinot, for unpartitioned tables, we pre-generate a routing table,
example, an analyst might be interested in knowing which coun- which is a list of mappings between servers and their subset of seg-
tries contribute the most page views for a given page, but not the ments to be processed when a query is executed. Formally, given
entire list of all countries that visited the page; for such a query, re- a set of segments U = {S 1 , S 2 , ..., Sn } and a collection A of m sets
turning the countries that exceed a minimum page view threshold that represent the segments assigned to each server, generating a
Pinot includes a partition function that matches the behavior of
the Kafka partition function, allowing for Pinot offline data to be
partitioned in the same way as the realtime data.

Algorithm 1 Routing table generation

▷ S a set of segments for this table
▷ I a set of instances assigned for this table
▷ T the target server count per query
▷ Sor phan a set of segments with no servers associated
▷ Iused a set of instances in use
▷ IS a map of instances to a list of segments served by the in-
stance
Figure 10: select sum(Impressions) from Table where
▷ SI a map of segments to a list of instances that serves this
Browser = ’firefox’ or Browser = ’safari’ group by
segment
Country
▷ Q si a priority queue of segments and potential instances list,
sorted in ascending order of the length of the instance list
single routing table entry is the act of picking a collection of sub- procedure GenerateRoutingTable
sets of the elements of A such that the union of the sub-sets in the Sor phan ← S ▷ Initially, all segments have no associated
collection is equivalent to U . instance
Pinot supports various query routing options, which were found Iused ←
to be necessary at scale. The default query routing strategy in Pinot if lenдth(I ) ≤ T then
is a balanced strategy that simply divides all the segments con- Iused ← I ▷ If there are less than T instances, use all
tained in a table in an equal fashion across all available servers. In instances
other words, when a query is processed, all servers are contacted Sor phan ←
and given a query to execute on their share of the segments to else
process. while lenдth(Iused ) ≤ T do ▷ Pick T random instances
The balanced strategy works well for small and medium sized Ir andom ← PickRandom(I )
clusters, but quickly becomes impractical for larger clusters. Intu- Iused ← Iused ∪ {Ir andom }
itively, we can guess that the larger the cluster, the more likely it Sor phan ← Sor phan − IS.дet(Ir andom )
is that a single host in the cluster will be unavailable or have is- end while
sues that slow down query processing. Figure 14 and 16 of [24] end if
empirically show that such stragglers exist even in other systems. while Sor phan , do ▷ Add servers to serve orphan
As such, Pinot has a special routing strategy for large clusters segments
that minimizes the number of hosts contacted in the cluster for Ir andom ← PickRandom(SI .дet(Sor phan .f irst))
any given query; this minimizes the adverse impact of any given Iused ← Iused ∪ {Ir andom }
misbehaving host and reduces tail latency for larger clusters. Sor phan ← Sor phan − IS.дet(Ir andom )
Since picking the exact minimal subset of A such that U is cov- end while
ered is a NP-hard problem, we have implemented a random greedy Q si ←
strategy that produces an approximately minimal subset that also for Scur r ent ← S do
ensures a balanced load across all servers. Algorithms 1 and 2 ex- Iseд ← SI .дet(Scur r ent ) ∩ Iused ▷ Get instances for this
plain how routing tables are generated and selected during the segment
large cluster routing table generation. During the large cluster rout- Q si .put(Scur r ent , Iseд )
ing table generation, many routing tables are generated by taking end for
a random subset of servers and adding additional servers until U is R←
completely covered; segments are then assigned as evenly as possi- while Q si , do ▷ Iterate segments in ascending order of
ble amongst the servers selected. For each routing table generated, instances
a metric is used to determine the routing table’s fitness — empiri- (Scur r ent , Iseд ) ← Q si .takeFirst()
cal testing has shown that the variance of the number of segments Ipicked ← PickWeightedRandomReplica(R, Iseд )
assigned per server works well — and the routing tables that have R.put(Scur r ent , Ipicked )
the lowest metrics are kept. end while
Pinot also supports partitioned tables, where data is partitioned return R
according to a partition function. When a table is partitioned, the end procedure
router does not generate routing tables but rather routes queries
only to the servers that contain relevant segments given the query
filters. Since Pinot supports realtime ingestion of data from Kafka,
Algorithm 2 Routing table selection • Use cases with low query rates but more complex queries
▷ H a max heap of routing tables and their associated or larger data volumes, such as advertiser audience reach,
metric self-service analytical tools or anomaly detection tools
▷ C the target routing table count The first type of use cases typically requires data to be present
▷ G the number of routing tables to generate in main memory in order to serve the tens of thousands of queries
procedure FilterRoutingTables per second required. These users tend to run a very small number
H ← of query patterns.
for i ← 1..C do The latter type of use cases are typically colocated on hardware
R ← GenerateRoutingTable with NVMe storage, as the lower query rates and more lenient la-
M ← ComputeMetric(R) tency expectations make it possible to simply load the data on de-
H .put(R, M) mand. These users typically have low query rates, but tend to have
end for bursty spikes of queries when users request dashboards or start an-
for i ← C..G do alyzing anomalies. For these use cases, colocation with other ten-
R ← GenerateRoutingTable ants is important to minimize the hardware footprint, as otherwise
M ← ComputeMetric(R) the hardware would be idle for a significant fraction of the time.
(R top , Mtop ) ← H .top()
if M ≤ Mtop then ▷ New routing table better than 5.2 Operational Concerns
the worst one? At Linkedin, Pinot is operated using a service model; teams devel-
H .pop() ▷ Yes, remove worst and add new one oping user features use Pinot just like any other service while the
H .put(R, M) Pinot team provides ongoing support and runs the service on a
end if day-to-day basis. As the number of teams using Pinot increases
end for over time, having dedicated staff to handle support and tuning on
end procedure a per team basis would mean that the staffing requirements would
increase over time. As such, we have designed Pinot to enable a
self-service model as much as possible.
4.5 Multitenancy For example, Pinot allows changing schemas on the fly to add
For larger companies, having dedicated clusters on a per-use case new columns without downtime. When a new column is added to
basis eventually becomes problematic; at Linkedin, there are cur- an existing schema, it is automatically added with a default value
rently several thousand tables split between more than 50 tenants. on all previously existing segments and made available within a
As such, Pinot supports multitenancy, with multiple tenants colo- few minutes. We also parse the query logs and execution statistics
cated on the same hardware. To prevent any given tenant from on an ongoing basis in order to automatically add inverted indexes
starving other tenants of query resources, a token bucket is used on columns where they would prove beneficial.
to distribute query resources on a per tenant basis. Each query Replicating table configurations across multiple data centers and
deducts a number of tokens from its tenant’s bucket that is pro- environments (testing and production) has routinely been a prob-
portional to the query execution time; when the bucket is empty, lem. Currently, our solution is to store table configurations in source
queries are enqueued to be processed whenever tokens are avail- control and synchronize them with Pinot on an ongoing basis through
able again. The token bucket slowly refills over time, allowing for Pinot’s REST API. This allows us to have an audit trail of changes
short transient spikes in query loads but preventing a misbehaving and leverage search, validation, and code review tooling for all
tenant from exhausting resources for other colocated tenants. schema, index and configuration changes.

5 PINOT IN PRODUCTION 6 PERFORMANCE

At Linkedin, Pinot runs on over 3000 geographically distributed We now evaluate the performance of Pinot in various scenarios,
hosts, serves over 1500 tables with over one million segments, for a ranging from low throughput use cases to high throughput ones.
total compressed data size of nearly 30 TiB (excluding data replica- As Druid is a system similar to Pinot, we also compare the perfor-
tion). Pinot’s current production query rate across all data centers mance of Druid and Pinot on those scenarios. As to ensure a realis-
exceeds 50000 queries per second. tic evaluation, data sets and queries were pulled from production
We now discuss the practical lessons learned while running Pinot systems; the data sets comprised the entirety of the data for each
at scale at Linkedin. scenario evaluated while the queries were sampled to have tens of
thousands of different queries in order to simulate a production en-
vironment. Realtime ingestion was disabled for both systems, due
5.1 Use Case Types
to the complexity of setting a repeatable environment shared be-
At Linkedin, we have observed that users of Pinot tend to be split tween Pinot and Druid.
between two categories: For both Pinot and Druid, nine hosts equipped with single socket
• Use cases with high throughput, low complexity queries for Xeon® E5-2680 v3 processors running at 2.50 GHz are used to run
relatively simple analytical features like “who viewed my the query processing. For Pinot, this means that the Pinot server
profile” and feed customization. is installed on these hosts; for Druid, the historical server and the
middle manager are running on these hosts, as recommended by implemented in Pinot as 10000 queries are executed sequentially.
the Druid documentation. Each host has 64 GiB of RAM installed We can see that all systems have performance that is acceptable
and 2.8 TB of NVMe storage attached. The operating system used for user interaction. We can see that Druid has comparable per-
is Red Hat Enterprise Linux release 6.9 (Santiago). formance with Pinot, when there are no indexes in Pinot; some
For query distribution, three instances of the Pinot broker are queries execute faster in Druid, but Druid also has more queries
running. The brokers are equipped with dual Xeon® E5-2630 v2 that have high latency than Pinot without indexes. We can also
processors clocked at 2.60GHz and 64 GiB of RAM. One host runs see that adapted index types improve performance over unindexed
RHEL release 6.5 while the other two run RHEL release 6.6. For data.
Druid, the brokers are equipped with a single Xeon® E5-2680 v3 Figure 13 shows the distribution of the ratio of preaggregated
CPU clocked at 2.50GHz and 64 GiB of RAM. One broker runs records scanned during query execution using star tree versus the
RHEL 6.7 while the other two run RHEL 6.9. One broker host is number of original unaggregated records. A ratio close to zero
also running the Druid coordinator and Druid overlord. means that fewer aggregated records are used to process a query
The first scenario in which we evaluate the performance of Pinot than execution over raw data, while a ratio close to one means that
and Druid replicates the environment used for ad hoc reporting there are little gains from preaggregation. We can see that most
and anomaly detection on multidimensional key business metrics queries execute on substantially fewer records than execution on
at Linkedin. As our anomaly detection system has a user-visible raw, unaggregated data.
component, the query set for this scenario has both automatically
generated queries used for monitoring as well as ad hoc queries
based on users doing root cause analysis for detected anomalies;

Druid
this means that Pinot needs to offer both high throughput for auto-
mated queries as well as low latency queries for interactive queries
coming from users. Queries contain aggregations of metrics with

No indices
a variable number of filtering predicates and grouping clauses, de-
pending on the specific drill down requested by the user.
Frequency

The data size for this scenario is 16 GB of data for Pinot and
13.8 GB of data for Druid. Figure 11 shows the latency of different

Inverted index
indexing strategies as the query rate increases. On this dataset, the
performance of Druid quickly becomes too high for interactive pur-
poses. As expected, the performance of Pinot without indexes also
drops out quickly. Adding inverted indices also increases the scala-
bility of Pinot by a factor of two for this dataset, but the largest gain

Star tree
in scalability comes from integrating the star tree index structure
described in section 4.3.
0 100 200 300
Druid No indexes Ratio of aggregated to unaggregated records

10000 Figure 12: Distribution of query latency when running

queries sequentially on the anomaly detection dataset
1000
Latency (milliseconds)

100

Inverted index Star tree

Frequency

10000

1000

100

0 20 40 60 0 20 40 60 0.00 0.25 0.50 0.75 1.00

Queries per second Latency (milliseconds)

Latency percentile 50 90 95 99 Figure 13: Distribution of the ratio of preaggregated records

scanned during query execution using star tree versus the
Figure 11: Comparison of indexing techniques on the anom- number of original unaggregated records
aly detection dataset
Pinot is also used to answer high selectivity analytical queries
The kernel density estimate plot in figure 12 shows the distri- from end users. Examples of this are the various analytical tools
bution of the latency of Druid and different indexing techniques available for end users that allow some limited analytics on who
viewed their published content as well as “who viewed my pro- for that user; items that have been seen before by a user are “dis-
file.” Queries for these scenarios are simple aggregations (sum of counted” based on their interaction with these items so that ig-
clicks/views, distinct count of viewers) with a few facets such as re- nored items are ranked lower for that user. This way, any given
gion, seniority or industry for a piece of shared content or a given user sees a fresh news feed that contains more relevant items and
user’s profile views. fewer ignored items. For Pinot, this means that every news feed
The data size for Pinot is 300 GB and 1.2 TB for Druid. Figure 14 view sends several queries to Pinot to fetch the list of items that
shows the performance of Druid and Pinot as the query rate in- have been seen by a user. Furthermore, each news feed view and
creases. For this particular comparison, there are two major differ- scroll event sends additional events to be indexed by Pinot so that
ences between Pinot and Druid: the generation of inverted indexes they can be made available for subsequent queries in near realtime.
and the physical row ordering. In Druid, all dimension columns Figure 16 shows the results of adding query routing optimiza-
have an associated inverted index; as not all dimensions are used in tions to Pinot, as described in section 4.4, and contrasts the perfor-
filtering predicates, this leads to a larger on disk size for Druid over mance with Druid as a baseline. The performance of Druid on this
Pinot. A large part of the performance difference between Druid dataset is significantly better than with other datasets, although it
and Pinot in this comparison is due to the physical row ordering does not scale as well as Pinot. In this scenario, we can see that
in Pinot, where data is sorted based on the shared item identifier. while performance at low query rates is similar between unparti-
tioned and partitioned tables, adding partition awareness on the
Druid Pinot broker limits the amount of overhead as the query rate increases,
Latency (milliseconds)

100000 leading to a significantly flatter latency curve.

10000

1000 Druid No routing optimizations With routing optimizations

Latency (milliseconds)
10000
100

0.6 0.8 1.0 1.2 0 1000 2000 3000 4000 100

Queries per second

Latency percentile 50 90 95 99 40 80 120 0 1000 2000 3000 0 1000 2000 3000 4000
Queries per second
Figure 14: Comparison of Druid and Pinot on the “share an-
alytics” dataset Latency percentile 50 90 95 99

Figure 16: Comparison of routing optimizations on the im-

As discussed in section 4.2, the physical ordering of records has
pression discounting dataset
a significant impact on scalability. Figure 15 illustrates the scala-
bility difference between physically ordered records and bitmap-
based inverted indexes in Pinot when running against the “who
viewed my profile” dataset. Both Druid and Pinot use roaring bitmaps [6, 7 CONCLUSION
7] for their implementation of bitmap-based inverted indexes. We have described the architecture of a production-grade system
that is used to serve tens of thousands of queries per second in a de-
Sorted Inverted index manding web environment, as well as some of the lessons learned
Latency (milliseconds)

from scaling such a system. We have also shown that a single sys-
400 tem can process a wide range of commonly encountered analyti-
cal queries from a large web site. Finally, we have also compared
200 how Druid, a system similar to Pinot, performs with production
data and queries from a large professional network, as well as the
0 impact of various indexing techniques and optimizations imple-
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 mented in Pinot.
Queries per second
Future work includes adding additional types of indexes and
specialized data structures for query optimization and observing
Latency percentile 50 90 95 99
their effects on query performance and service scalability.

Figure 15: Comparison of indexing techniques on the “Who REFERENCES

viewed my profile” dataset [1] 2016. Presto: Distributed SQL Query Engine for Big Data. (2016). https://
prestodb.io/
[2] Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compres-
Another interesting type of scenario for Pinot is illustrated by sion and execution in column-oriented database systems. In Proceedings of the
the implementation of impression discounting [22]. Impression dis- 2006 ACM SIGMOD international conference on Management of data. ACM, 671–
682.
counting is the practice of tracking what items have been seen by [3] Kevin Beyer and Raghu Ramakrishnan. 1999. Bottom-up computation of sparse
a particular user and using this to personalize the display of items and iceberg cube. In ACM SIGMOD Record, Vol. 28. ACM, 359–370.
[4] MKABV Bittorf, Taras Bobrovytsky, Casey Ching Alan Choi Justin Erick- datasets. (2013). Proceedings of the IEEE Big Data Visualization Workshop
son, Martin Grund Daniel Hecht, Matthew Jacobs Ishaan Joshi Lenni Kuff, 2013.
Dileep Kumar Alex Leblang, Nong Li Ippokratis Pandis Henry Robinson, David [18] Chris Johnson. 2004. Top scientific visualization research problems. IEEE Com-
Rorke Silvius Rus, John Russell Dimitris Tsirogiannis Skye Wanderman, and puter Graphics and Applications (CG&A) 24 (2004).
Milne Michael Yoder. 2015. Impala: A modern, open-source SQL engine for [19] Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging
Hadoop. In Proceedings of the 7th Biennial Conference on Innovative Data Sys- system for log processing. In Proceedings of the NetDB. 1–7.
tems Research. [20] Tirthankar Lahiri, Shasank Chavan, Maria Colgan, Dinesh Das, Amit Ganesh,
[5] Peter A Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper- Mike Gleeson, Sanket Hase, Allison Holloway, Jesse Kamp, Teck-Hua Lee, et al.
Pipelining Query Execution.. In CIDR, Vol. 5. 225–237. 2015. Oracle database in-memory: A dual format in-memory database. In Data
[6] Samy Chambi, Daniel Lemire, Robert Godin, Kamel Boukhalfa, Charles R Allen, Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 1253–1258.
and Fangjin Yang. 2016. Optimizing druid with roaring bitmaps. In Proceedings [21] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver,
of the 20th International Database Engineering & Applications Symposium. ACM, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years
77–86. later. Proceedings of the VLDB Endowment 5, 12 (2012), 1790–1801.
[7] Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better [22] Pei Lee, Laks VS Lakshmanan, Mitul Tiwari, and Sam Shah. 2014. Modeling
bitmap performance with roaring bitmaps. Software: practice and experience 46, impression discounting in large-scale recommender systems. In Proceedings of
5 (2016), 709–719. the 20th ACM SIGKDD international conference on Knowledge discovery and data
[8] C. Chen. 2005. Top 10 unsolved information visualization problems. Computer mining. ACM, 1837–1846.
Graphics and Applications, IEEE 25, 4 (july-aug. 2005), 12 – 16. https://doi.org/ [23] Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of
10.1109/MCG.2005.91 scalable realtime data systems. Manning Publications Co.
[9] Jack Chen, Samir Jindel, Robert Walzer, Rajkumar Sen, Nika Jimsheleishvilli, and [24] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv-
Michael Andrews. 2016. The MemSQL Query Optimizer: A modern optimizer for akumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: interactive analysis of
real-time analytics in a distributed database. Proceedings of the VLDB Endowment web-scale datasets. Proceedings of the VLDB Endowment 3, 1-2 (2010), 330–339.
9, 13 (2016), 1401–1412. [25] Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David
[10] Jeffrey Dean, Sanjay Ghemawat, and Google Inc. 2004. MapReduce: simplified Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu,
data processing on large clusters. In In OSDI04: Proceedings of the 6th conference Guy M Lohman, et al. 2013. DB2 with BLU acceleration: So much more than just
on Symposium on Opearting Systems Design & Implementation. USENIX Associ- a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080–1091.
ation. [26] Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherni-
[11] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and ack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil,
Jeffrey D Ullman. 1999. Computing Iceberg Queries Efficiently.. In Internaational et al. 2005. C-store: a column-oriented DBMS. In Proceedings of the 31st interna-
Conference on Very Large Databases (VLDB’98), New York, August 1998. Stanford tional conference on Very large data bases. VLDB Endowment, 553–564.
InfoLab. [27] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
[12] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: a
and Wolfgang Lehner. 2012. SAP HANA database: data management for modern warehousing solution over a map-reduce framework. Proceedings of the VLDB
business applications. ACM Sigmod Record 40, 4 (2012), 45–51. Endowment 2, 2 (2009), 1626–1629.
[13] Kishore Gopalakrishna, Shi Lu, Zhen Zhang, Adam Silberstein, Kapil Surlaker, [28] Lili Wu, Roshan Sumbaly, Chris Riccomini, Gordon Koo, Hyung Jin Kim, Jay
Ramesh Subramonian, and Bob Schulman. 2012. Untangling cluster manage- Kreps, and Sam Shah. 2012. Avatara: OLAP for web-scale analytics products.
ment with Helix. In ACM Symposium on Cloud Computing, SOCC ’12, San Jose, Proceedings of the VLDB Endowment 5, 12 (2012), 1874–1877.
CA, USA, October 14-17, 2012. 19. [29] Dong Xin, Jiawei Han, Xiaolei Li, and Benjamin W Wah. 2003. Star-cubing: Com-
[14] Alexander Hall, Olaf Bachmann, Robert Büssow, Silviu Gănceanu, and Marc puting iceberg cubes by top-down and bottom-up integration. In Proceedings of
Nunkesser. 2012. Processing a trillion cells per mouse click. Proceedings of the the 29th international conference on Very large data bases-Volume 29. VLDB En-
VLDB Endowment 5, 11 (2012), 1436–1446. dowment, 476–487.
[15] Jeffrey Heer and Ben Shneiderman. 2012. Interactive Dynamics for Visual Analy- [30] Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep
sis. Queue 10, 2, Article 30 (Feb. 2012), 26 pages. https://doi.org/10.1145/2133416. Ganguli. 2014. Druid: A real-time analytical data store. In Proceedings of the 2014
2146416 ACM SIGMOD international conference on Management of data. ACM, 157–168.
[16] Jean-Francois Im, Michael J McGuffin, and Rock Leung. 2013. GPLOM: the gen- [31] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion
eralized plot matrix for visualizing multidimensional multivariate data. Visual- Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10 (2010),
ization and Computer Graphics, IEEE Transactions on 19, 12 (2013), 2606–2614. 10–10.
[17] Jean-François Im, Félix Giguère Villegas, and Michael J. McGuffin. 2013. VisRe- [32] Hongwei Zhao and Xiaojun Ye. 2013. A multidimensional OLAP engine imple-
duce: Fast and responsive incremental information visualization of large mentation in key-value database systems. In Workshop on Big Data Benchmarks.
Springer, 155–170.

In Search of Database Nirvana
100% (1)
In Search of Database Nirvana
54 pages
Phprunner
80% (5)
Phprunner
1,346 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
30 pages
DMBI Sort
No ratings yet
DMBI Sort
89 pages
Greenplum A Hybrid Database For Transactional and Analytical Workloads
No ratings yet
Greenplum A Hybrid Database For Transactional and Analytical Workloads
27 pages
Longshine WA5 40P Manual Eng
No ratings yet
Longshine WA5 40P Manual Eng
50 pages
Vldb-2023-Aerospike Submission
No ratings yet
Vldb-2023-Aerospike Submission
13 pages
An Overview of Data Warehousing and OLAP Technology: Microsoft Research, Hewlett-Packard Labs, Palo
No ratings yet
An Overview of Data Warehousing and OLAP Technology: Microsoft Research, Hewlett-Packard Labs, Palo
10 pages
Columnstore ICDE 2016
No ratings yet
Columnstore ICDE 2016
13 pages
BD-Unit03-6 7 and 8-1
No ratings yet
BD-Unit03-6 7 and 8-1
68 pages
Roll No. - Course:-ADIT - 6 Month Time: - 1 Hours Mark: - 80 Student Sign - Examiner Sign
No ratings yet
Roll No. - Course:-ADIT - 6 Month Time: - 1 Hours Mark: - 80 Student Sign - Examiner Sign
6 pages
Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th
No ratings yet
Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th
11 pages
HPC Paper2
No ratings yet
HPC Paper2
9 pages
Data Warehouse Administration
No ratings yet
Data Warehouse Administration
14 pages
Application Programming Using Pooling-1
No ratings yet
Application Programming Using Pooling-1
26 pages
It-222 Reviewer
No ratings yet
It-222 Reviewer
3 pages
Google Mesa
No ratings yet
Google Mesa
12 pages
DW Lab Case Study
No ratings yet
DW Lab Case Study
11 pages
5 Data Warehouse
No ratings yet
5 Data Warehouse
17 pages
Unit 4
No ratings yet
Unit 4
60 pages
Exam Overview: GCP Data Engineer
100% (5)
Exam Overview: GCP Data Engineer
12 pages
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 51-60
No ratings yet
Big Data Engineer 2021-Ecosystem-Course Guide (2) - 51-60
10 pages
Lec 3
No ratings yet
Lec 3
38 pages
Intrusion Detection and Online Booking Using Apache Kafka and Spark
No ratings yet
Intrusion Detection and Online Booking Using Apache Kafka and Spark
22 pages
Techniques and Efficiencies From Building A Real-Time DBMS
No ratings yet
Techniques and Efficiencies From Building A Real-Time DBMS
13 pages
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
No ratings yet
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
23 pages
Teradata Utilities
No ratings yet
Teradata Utilities
139 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Full Doc Janani
No ratings yet
Full Doc Janani
121 pages
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
No ratings yet
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
51 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
SC2012 VMM Documentation
No ratings yet
SC2012 VMM Documentation
770 pages
A Survey On Estimation of Time On Hadoop Cluster For Data Computation
No ratings yet
A Survey On Estimation of Time On Hadoop Cluster For Data Computation
4 pages
Bda 1
No ratings yet
Bda 1
26 pages
Bda Test1 Key Answers
No ratings yet
Bda Test1 Key Answers
7 pages
Data Mining UNIT I
No ratings yet
Data Mining UNIT I
11 pages
15 Days of Power BI - Complete Microsoft Power BI Bootcamp
0% (2)
15 Days of Power BI - Complete Microsoft Power BI Bootcamp
6 pages
Ese Bda
No ratings yet
Ese Bda
28 pages
SC4x W3L1 TopicsInDatabases v2
No ratings yet
SC4x W3L1 TopicsInDatabases v2
37 pages
Database 240112 181346
No ratings yet
Database 240112 181346
16 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
Data Warehousing and Decision Support
No ratings yet
Data Warehousing and Decision Support
8 pages
NextionLTS Instruction Set
No ratings yet
NextionLTS Instruction Set
49 pages
CCD Chapter 3 Notes
No ratings yet
CCD Chapter 3 Notes
11 pages
Multi-Tenant Analytics With Auth0 and Cube - Js - The Complete Guide
No ratings yet
Multi-Tenant Analytics With Auth0 and Cube - Js - The Complete Guide
27 pages
Linux Basics
No ratings yet
Linux Basics
69 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
2014 Ieee Computer Nosql
No ratings yet
2014 Ieee Computer Nosql
4 pages
Ijeme V13 N4 5
No ratings yet
Ijeme V13 N4 5
9 pages
NoSQL Database For Software
No ratings yet
NoSQL Database For Software
49 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
Relational Reporting Business Reporting Management Reporting
No ratings yet
Relational Reporting Business Reporting Management Reporting
7 pages
Dell Emc Data Protection Suite Family
No ratings yet
Dell Emc Data Protection Suite Family
4 pages
CS8491 Ca Unit 1
No ratings yet
CS8491 Ca Unit 1
32 pages
Hana 99
No ratings yet
Hana 99
4 pages
System X Server For Sap Business One Version For Sap Hana
No ratings yet
System X Server For Sap Business One Version For Sap Hana
5 pages
Certified Cisco Certified Technician Routing and Switching (640-692)
No ratings yet
Certified Cisco Certified Technician Routing and Switching (640-692)
2 pages
Sow Ict JHS 1
No ratings yet
Sow Ict JHS 1
14 pages
ICDL Syllabus Version 5.0
No ratings yet
ICDL Syllabus Version 5.0
37 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
MB - Master - User Guide PDF
No ratings yet
MB - Master - User Guide PDF
21 pages
140CPU42402
No ratings yet
140CPU42402
10 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Exam 70-761: Querying Data With Transact-SQL - Skills Measured
No ratings yet
Exam 70-761: Querying Data With Transact-SQL - Skills Measured
3 pages
Tribhuvan University: Candidates Are Required To Give Their Answers in Their Own Words As For As Practicable
No ratings yet
Tribhuvan University: Candidates Are Required To Give Their Answers in Their Own Words As For As Practicable
1 page
Crear Base de Datos SQL
No ratings yet
Crear Base de Datos SQL
3 pages
Trev 2010-Q3 MXF-1
No ratings yet
Trev 2010-Q3 MXF-1
10 pages
Data Pipeline
No ratings yet
Data Pipeline
6 pages
Database Security Issues
No ratings yet
Database Security Issues
7 pages
Paper - 1 - HL Questions
No ratings yet
Paper - 1 - HL Questions
3 pages
Odjms Report
No ratings yet
Odjms Report
45 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Worksheet1 - Types & Components of A Computer System
No ratings yet
Worksheet1 - Types & Components of A Computer System
5 pages
The Law of Large Numbers - A Foundation For Statistical Modeling in Distributed Systems - Jack Vanlightly
No ratings yet
The Law of Large Numbers - A Foundation For Statistical Modeling in Distributed Systems - Jack Vanlightly
9 pages
Table Format Interoperability, Future or Fantasy - Jack Vanlightly
No ratings yet
Table Format Interoperability, Future or Fantasy - Jack Vanlightly
9 pages
1 3-Exchanging-Data 280155520
No ratings yet
1 3-Exchanging-Data 280155520
2 pages
To Be Atomic or Non-Atomic, That Is The Question (Fizzbee) - Jack Vanlightly
No ratings yet
To Be Atomic or Non-Atomic, That Is The Question (Fizzbee) - Jack Vanlightly
14 pages
An Introduction To Symmetry in TLA+ - Jack Vanlightly
No ratings yet
An Introduction To Symmetry in TLA+ - Jack Vanlightly
15 pages
Obtaining Statistical Properties Through Modeling and Simulation - Jack Vanlightly
No ratings yet
Obtaining Statistical Properties Through Modeling and Simulation - Jack Vanlightly
21 pages
Designing Distributed Systems Selecteive
No ratings yet
Designing Distributed Systems Selecteive
19 pages
Why Do You Need Apache Iceberg
No ratings yet
Why Do You Need Apache Iceberg
10 pages
Distance Functions
No ratings yet
Distance Functions
10 pages
Gradient 5
No ratings yet
Gradient 5
8 pages
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
From Everand
Python Data Cleaning and Preparation Best Practices: A practical guide to organizing and handling data from various sources and formats using Python
Maria Zervou
No ratings yet
Ext JS Application Development Blueprints
From Everand
Ext JS Application Development Blueprints
Colin Ramsay
No ratings yet
AI-Powered Lean: How to Apply Artificial Intelligence to Improve Processes, Cut Waste, and Deliver Faster Results
From Everand
AI-Powered Lean: How to Apply Artificial Intelligence to Improve Processes, Cut Waste, and Deliver Faster Results
Mohammed Hamed Ahmed Soliman
5/5 (1)
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
From Everand
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
Pradeep Pasupuleti
No ratings yet
AI in Leadership: Navigating the Future
From Everand
AI in Leadership: Navigating the Future
Dr. Georgetta A Robinson DBA PsyD
No ratings yet
Sustainable Asset Management: AI & Blockchain Unleashed
From Everand
Sustainable Asset Management: AI & Blockchain Unleashed
Prashant Sinha
No ratings yet
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
From Everand
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
Navin Sabharwal
No ratings yet
High Velocity Itsm: Agile It Service Management for Rapid Change in a World of Devops, Lean It and Cloud Computing
From Everand
High Velocity Itsm: Agile It Service Management for Rapid Change in a World of Devops, Lean It and Cloud Computing
Randy A. Steinberg
No ratings yet
An Introduction to SDN Intent Based Networking
From Everand
An Introduction to SDN Intent Based Networking
alasdair gilchrist
5/5 (1)
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet

Pinot

Uploaded by

Pinot

Uploaded by

Pinot: Realtime OLAP for 530 Million Users

Jean-François Im Kishore Gopalakrishna Subbu Subramaniam

Mayank Shrivastava Adwait Tumbde Xiaotian Jiang

Jennifer Dai Seunghyun Lee Neha Pawar

Jialiang Li Ravi Aringunram

ABSTRACT ACM Reference Format:

3.4 Cloud-Friendly Architecture

4.2 Indexing and Physical Record Storage

Algorithm 1 Routing table generation

5 PINOT IN PRODUCTION 6 PERFORMANCE

10000 Figure 12: Distribution of query latency when running

Inverted index Star tree

0 20 40 60 0 20 40 60 0.00 0.25 0.50 0.75 1.00

Latency percentile 50 90 95 99 Figure 13: Distribution of the ratio of preaggregated records

100000 leading to a significantly flatter latency curve.

1000 Druid No routing optimizations With routing optimizations

0.6 0.8 1.0 1.2 0 1000 2000 3000 4000 100

Figure 16: Comparison of routing optimizations on the im-

Figure 15: Comparison of indexing techniques on the “Who REFERENCES

You might also like