Pinot
Pinot
3 ARCHITECTURE
Pinot is a scalable distributed OLAP data store developed at LinkedIn
to deliver real time analytics with low latency. Pinot is optimized Figure 1: Pinot Segment
for analytical use cases on immutable append-only data and offers
data freshness that is on the order of a few seconds. We have been
3.2 Components Minions are responsible for running compute-intensive mainte-
Pinot has four main components for data storage, data manage- nance tasks. Minions execute tasks assigned to them by the con-
ment, and query processing: controllers, brokers, servers, and min- trollers’ job scheduling system. The task management and sched-
ions. Additionally, Pinot depends on two external services: Zoo- uling is extensible to add new job and schedule types in order to
keeper and a persistent object store. Pinot uses Apache Helix [13] satisfy evolving business requirements.
for cluster management. Apache Helix is a generic cluster man- An example of a task that is run on the minions is data purg-
agement framework that manages partitions and replicas in a dis- ing. Linkedin must sometimes purge member-specific data in or-
tributed system. der to comply with various legal requirements. As data in Pinot is
Servers are the main component responsible for hosting seg- immutable, a routine job is scheduled to download segments, ex-
ments and processing queries on those segments. A segment is punge the unwanted records, rewrite and reindex the segments be-
stored as a directory in the UNIX filesystem consisting of a seg- fore finally uploading them back into Pinot, replacing the previous
ment metadata file and an index file. The segment metadata pro- segments.
vides information about the set of columns in the segment, their Zookeeper is used as a persistent metadata store and as the com-
type, cardinality, encoding, various statistics, and the indexes avail- munication mechanism between nodes in the cluster. All informa-
able for that column. An index file stores indexes for the all the tion about the cluster state, segment assignment and metadata is
columns. This file is append-only which allows the server to create stored in Zookeeper though Helix. Segment data itself is stored
inverted indexes on demand. Servers have a pluggable architecture in the persistent object store. At Linkedin, Pinot uses a local NFS
that supports loading columnar indexes from different storage for- mountpoint for data storage but we have also used Azure Disk stor-
mats as well as generating synthetic columns at runtime. This can age when running outside of Linkedin’s datacenters.
be easily extended to read data from distributed filesystems like
HDFS or S3. We maintain multiple replicas of a segment within a 3.3 Common Operations
datacenter for higher availability and query throughput. All repli- We now explain how common operations are implemented in Pinot.
cas participate in query processing.
3.3.1 Segment Load. Helix uses state machines to model the
cluster state; each resource in the cluster has its own current state
and desired cluster state. When either state changes, the appropri-
ate state transitions are sent to the respective nodes to be executed.
Pinot uses a simple state machine for segment management, as
shown in Figure 3. Initially, segments start in the OFFLINE state
and Helix requests server nodes to process the OFFLINE to ON-
LINE transition. In order to handle the state transition, servers
fetch the relevant segment from the object store, unpack it, load
it, and make it available for query execution. Upon the completion
of the state transition, the segment is marked as being in the ON-
LINE state in Helix.
For realtime data that is to be consumed from Kafka, a state
transition happens from the OFFLINE to CONSUMING state. Upon
processing this state transition, a Kafka consumer is created with a
Figure 2: Pinot Cluster Management given start offset; all replicas for this particular segment start con-
suming from Kafka at the same location. A consensus protocol de-
Controllers are responsible for maintaining an authoritative map- scribed in section 3.3.6 ensures that all replicas converge towards
ping of segments to servers using a configurable strategy. Con- an exact copy of the segment.
trollers own this mapping and trigger changes to it on operator
requests or in response to the changes in server availability. Addi-
tionally, controllers support various administrative tasks such as OFFLINE
listing, adding, or deleting tables and segments. Tables can be con-
figured to have a retention interval after which segments past the
retention period are garbage collected by the controller. All the
DROPPED CONSUMING
metadata and mapping of segments to servers is managed using
Apache Helix. For fault tolerance, we run three controller instances
in each datacenter with a single master; non-leader controllers are
mostly idle. Controller mastership is managed by Apache Helix. ONLINE
Brokers route incoming queries to appropriate server instances,
collect partial query responses, merge them into a final result, which
is then sent back to the client. Pinot clients send their queries to Figure 3: Pinot Segment State Machine
brokers over HTTP, allowing for load balancers to be placed in
front of the pool of brokers.
Figure 4: Pinot Segment Load Figure 5: Query Planning Phases
3.3.2 Routing Table Update. When segments are loaded and un- that is shared between the offline and realtime tables. In practice,
loaded, Helix updates the current cluster state. Brokers listen to we have not found this to be an onerous requirement, as most data
changes to the cluster state and update their routing tables, a map- written to streaming systems tend to have a temporal component.
ping between servers and available segments. This ensures that
brokers are routing queries to replicas that are available as new
replicas come online or are marked as unavailable. The process of
routing table creation is described in more detail in section 4.4.
3.3.3 Query Processing. When a query arrives on a broker, sev-
eral steps happen:
(1) The query is parsed and optimized
(2) A routing table for that particular table is picked at random
(3) All servers in the routing table are contacted and asked to
process the query on a subset of segments in the table Figure 6: Hybrid Query Rewriting
(4) Servers generate logical and physical query plans based on
index availability and column metadata
(5) The query plans are scheduled for execution 3.3.4 Server-Side Query Execution. On the server-side, when a
(6) Upon completion of all query plan executions, the results query is received, logical and physical query plans are generated.
are gathered, merged and returned to the broker As available indexes and physical record layouts can be different
(7) When all results are gathered from the servers, the partial between segments, query plans are generated on a per-segment ba-
per-server results are merged together. Errors or timeouts sis. This allows Pinot to do certain optimizations for special cases,
during processing cause the query result to be marked as such as a predicate matching all values of a segment. Special query
partial, so that the client can choose to either display in- plans are also generated for queries that can be answered using seg-
complete query results to the user or resubmit the query at ment metadata, such as obtaining the maximum value of a column
a later time. without any predicates.
(8) The query result is returned to the client Physical operator selection is done based on an estimated exe-
cution cost and operators can be reordered in order to lower the
Pinot supports dynamically merging data streams that come
overall cost of processing the query based on per-column statistics.
from offline and realtime systems. To do so, these hybrid tables
The resulting query plans are then submitted for execution to the
contain data that overlaps temporally. Figure 6 shows that a hypo-
query execution scheduler. Query plans are processed in parallel.
thetical table with two segments per day might have overlapping
data for August 1st and 2nd . When such a query arrives in Pinot, 3.3.5 Data Upload. To upload data, segments are uploaded to
it is transparently rewritten into two queries; one query for the the controller using HTTP POST. When a controller receives a seg-
offline part, which queries data prior to the time boundary, and a ment, it unpacks it to ensure its integrity, verifies that the segment
second one for the realtime part, which queries data at or after the size would not put the table over quota, writes the segment meta-
time boundary. data in Zookeeper, then updates the desired cluster state by assign-
When both queries complete, the results are merged, allowing ing the segment to be in the ONLINE state on the appropriate num-
us to cheaply provide merging of offline and realtime data. In order ber of replicas. Updating the desired cluster state then triggers the
for this scheme to work, hybrid tables require having a time column segment load as described earlier.
the same exact data; however, two consumers consuming for a cer-
tain amount of time based on their local clock will likely diverge.
As such, Pinot has a segment completion protocol that ensures that
independent replicas have a consensus on what the contents of the
final segment should be.
When a segment completes consuming, the server starts polling
the leader controller for instructions and gives its current Kafka
offset. The controller then returns a single instruction to the server.
Possible instructions are:
HOLD Instructs the server to do nothing and poll at a later time
DISCARD Instructs the server to discard its local data and fetch
an authorative copy from the controller; this happens if an-
other replica has already successfully committed a different
version of the segment
CATCHUP Instructs the server to consume up to a given Kafka
offset, then start polling again
KEEP Instructs the server to flush the current segment to disk and
load it; this happens if the offset the server is at is exactly
the same as the one in the committed copy
COMMIT Instructs the server to flush the current segment to disk
and attempt to commit it; if the commit fails, resume polling,
otherwise, load the segment
NOTLEADER Instructs the server to look up the current cluster
leader as this controller is not currently the cluster leader,
then start polling again
Replies by the controller are managed by a state machine that
Figure 7: Query Planning Phases
waits until enough replicas have contacted the controller or enough
time has passed since the first poll to determine a replica to be the
committer. The state machine attempts to get all replicas to catch
up to the largest offset of all replicas and picks one of the replicas
with the largest offset to be the committer. On controller failure, a
new blank state machine is started on the new leader controller;
this only delays the segment commit, but otherwise has no effect
on correctness.
This approach minimizes network transfers while ensuring all
replicas have identical data when segments are flushed.
Druid
this means that Pinot needs to offer both high throughput for auto-
mated queries as well as low latency queries for interactive queries
coming from users. Queries contain aggregations of metrics with
No indices
a variable number of filtering predicates and grouping clauses, de-
pending on the specific drill down requested by the user.
Frequency
The data size for this scenario is 16 GB of data for Pinot and
13.8 GB of data for Druid. Figure 11 shows the latency of different
Inverted index
indexing strategies as the query rate increases. On this dataset, the
performance of Druid quickly becomes too high for interactive pur-
poses. As expected, the performance of Pinot without indexes also
drops out quickly. Adding inverted indices also increases the scala-
bility of Pinot by a factor of two for this dataset, but the largest gain
Star tree
in scalability comes from integrating the star tree index structure
described in section 4.3.
0 100 200 300
Druid No indexes Ratio of aggregated to unaggregated records
100
10000
1000
100
Latency (milliseconds)
10000
100
10
Latency percentile 50 90 95 99 40 80 120 0 1000 2000 3000 0 1000 2000 3000 4000
Queries per second
Figure 14: Comparison of Druid and Pinot on the “share an-
alytics” dataset Latency percentile 50 90 95 99
from scaling such a system. We have also shown that a single sys-
400 tem can process a wide range of commonly encountered analyti-
cal queries from a large web site. Finally, we have also compared
200 how Druid, a system similar to Pinot, performs with production
data and queries from a large professional network, as well as the
0 impact of various indexing techniques and optimizations imple-
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 mented in Pinot.
Queries per second
Future work includes adding additional types of indexes and
specialized data structures for query optimization and observing
Latency percentile 50 90 95 99
their effects on query performance and service scalability.