0% found this document useful (0 votes)

44 views13 pages

Gorilla - A Fast, Scalable, In-Memory Time Series Database - The

Uploaded by

billielisek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views13 pages

Gorilla - A Fast, Scalable, In-Memory Time Series Database - The

Uploaded by

billielisek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

the morning paper

a random walk through Computer Science research, by Adrian Colyer

Made delightfully fast by

 MENU

Gorilla: A fast, scalable, in-memory time series

database
MAY 3, 2016 ~ ADRIAN COLYER

Gorilla: A fast, scalable, in-memory time series database – Pelkonen et al. 2015

Error rates across one of Facebook’s sites were spiking. The problem had first shown up
through an automated alert triggered by an in-memory time-series database called Gorilla a
few minutes after the problem started. One set of engineers mitigated the immediate issue. A
second group set out to find the root cause. They fired up Facebook’s time series correlation
engine built on top of Gorilla, and searched for metrics showing a correlation with the errors.
This showed that copying a release binary to Facebook’s web servers (a routine event) caused
an anomalous drop in memory used across the site…
In the 18 months prior to publication, Gorilla helped Facebook engineers identify and debug
several such production issues.

 An important requirement to operating [these] large scale services is to

accurately monitor the health and performance of the underlying system ad
quickly identify and diagnose problems as they arise. Facebook uses a time
series database to store system measuring data points and provides quick query
functionalities on top.

As of Spring 2015, Facebook’s monitoring systems generated more than 2 billion unique time
series of counters, with about 12 million data points added per second – over 1 trillion data
points per day. Here then are the design goals for Gorilla:

Store 2 billion unique time series, identifiable via a string key

Insertion rate of 700 million data points (time stamp and floating point value) per minute
Data to be retained for fast querying over the last 26 hours
Up to 40,000 queries per second at peak
Reads to succeed in under one millisecond
Support time series with up to 15 second granularity (4 points/minute/time series)
Two in-memory replicas (not co-located) for DR
Continue to serve reads in the face of server crashes
Support fast scans over all in-memory data
Handle continued growth in time series data of 2x per year!

To meet the performance requirements, Gorilla is built as an in-memory TSDB that functions
as a write-through cache for monitoring data ultimately written to an HBase data store. To
meet the requirements to store 26 hours of data in-memory, Gorilla incorporates a new time
series compression algorithm that achieves an average 12x reduction in size. The in-memory
data structures allow fast and efficient scans of all data while maintaining constant time
lookup of individual time series.

 The key specified in the monitoring data is used to uniquely identify a time
series. By sharding all monitoring data based on these unique string keys, each
time series dataset can be mapped to a single Gorilla host. Thus, we can scale
Gorilla by simply adding new hosts and tuning the sharding function to map
new time series data to the expanded set of hosts. When Gorilla was launched
to production 18 months ago, our dataset of all time series data inserted in the
past 26 hours fit into 1.3TB of RAM evenly distributed across 20 machines. Since
then, we have had to double the size of the clusters twice due to data growth,
and are now running on 80 machines within each Gorilla cluster. This process
was simple due to the share-nothing architecture and focus on horizontal
scalability.

The in-memory data structure is anchored in a C++ standard library unordered map. This
proved to have sufficient performance and no issues with lock contention. For persistence
Gorilla stores data in GlusterFS, a POSIX-compliant distributed file system with 3x replication.
“HDFS, or other distributed file systems would have sufficed just as easily.” For more details on
the data structures and how Gorilla handles failures, see sections 4.3 and 4.4 in the paper. I
want to focus here on the techniques Gorilla uses for time series compression to fit all of that
data into memory!

Time series compression in Gorilla

 Gorilla compresses data points within a time series with no additional

compression used across time series. Each data point is a pair of 64-bit values
representing the time stamp and value at that time. Timestamps and values are
compressed separately using information about previous values.

When it comes to time stamps, a key observation is that most sources log points at fixed
intervals (e.g. one point every 60 seconds). Every now and then the data point may be logged a
little bit early or late (e.g., a second or two), but this window is normally constrained. We’re
now entering a world where every bit counts, so if we can represent successive time stamps
with very small numbers, we’re winning… Each data block is used to store two hours of data.
The block header stores the starting time stamp, aligned to this two hour window. The first
time stamp in the block (first entry after the start of the two hour window) is then stored as a
delta from the block start time, using 14 bits. 14 bits is enough to span a bit more than 4 hours
at second resolution so we know we won’t need more than that.

For all subsequent time stamps, we compare deltas. Suppose we have a block start time of
02:00:00, and the first time stamp is 62 seconds later at 02:01:02. The next data point is at
02:02:02, another 60 seconds later. Comparing these two deltas, the second delta (60 seconds),
is 2 seconds shorter than the first one (62). So we record -2. How many bits should we use to
record the -2? As few as possible ideally! We can use tag bits to tell us how many bits the actual
value is encoded with. The scheme works as follows:

1. Calculate the delta of deltas: D = (tn – tn-1) – (tn-1 – tn-2)

2. Encode the value according to the following table:
The particular values for the time ranges were selected by sampling a set of real time series
from production systems and choosing the ranges that gave the best compression ratios.

This produces a data stream that looks like this:

 Figure 3 shows the results of time stamp compression in Gorilla. We have found
that about 96% of all time stamps can be compressed to a single bit.

(i.e., 96% of all time stamps occur at regular intervals, such that the delta of deltas is zero).
So much for time stamps, what about the data values themselves?

 We discovered that the value in most time series does not change significantly
when compared to its neighboring data points. Further, many data sources only
store integers. This allowed us to tune the expensive prediction scheme in [25]
to a simpler implementation that merely compares the current value to the
previous value. If values are close together the sign, exponent, and first few bits
of the mantissa will be identical. We leverage this to compute a simple XOR of
the current and previous values rather than employing a delta encoding
scheme.
The values are then encoded as follows:

The first value is stored with no compression

For all subsequent values, XOR with the previous value. If the XOR results in 0 (i.e. the same
value), store the single bit ‘0’.
If the XOR doesn’t result in zero, it’s still likely to have a number of leading and trailing
zeros surrounding the ‘meaningful bits.’ Count the number of leading zeros and the number
of trailing zeros. E.g. with 0x0003200000000000 there are 3 leading zeros and 11 trailing
zeros. Store the single bit ‘1’ and then…
If the previous stored value also had the same or fewer leading zeros, and the same or
fewer trailing zeros we know that all the meaningful bits of the value we want to store fall
within the meaningful bit range of the previous value:
Store the control bit ‘0’ and then store the meaningful XOR’d value (i.e. 032) in the example
above.

If the meaningful bits do not fit within the meaningful bit range of the previous value, then
store the control bit ‘1’ followed by the number of leading zeros in the next 5 bits, the length
of the meaningful XOR’d value in the next 6 bits, and finally the meaningful bits of the
XOR’d value.
 Roughly 51% of all values are compressed to a single bit since the current and
previous values are identical. About 30% of the values are compressed with the
control bits ’10’ with an average compressed size of 26.6 bits. The remaining 19%
are compressed with control bits ’11’, with an average size of 36.9 bits, due to the
extra overhead required to encode the length of leading zero bits and
meaningful bits.

Building on top of Gorilla

Gorillas’ low latency processing (over 70x faster than the previous system it replaced) enabled
the Facebook team to build a number of tools on top, these include horizon charts; aggregated
roll-ups which update based on all completed buckets every two hours; and a correlation
engine that we saw being used in the opening case study.

 The correlation engine calculates the Pearson Product-Moment Correlation

Coefficient (PPMCC) which compares a test time series to a large set of time
series. We find that PPMCC’s ability to find correlation between similarly
shaped time series, regardless of scale, greatly helps automate root-cause
analysis and answer the question “What happened around the time my service
broke?”. We found that this approach gives satisfactory answers to our question
and was simpler to implement than similarly focused approaches described in
the literature[10, 18, 16]. To compute PPMCC, the test time series is distributed to
each Gorilla host along with all of the time series keys. Then, each host
independently calculates the top N correlated time series, ordered by the
absolute value of the PPMCC compared to the needle, and returning the time
series values. In the future, we hope that Gorilla enables more advanced data
mining techniques on our monitoring time series data, such as those described
in the literature for clustering and anomaly detection [10, 11, 16].

Summing up lessons learned the authors provide three further takeaways:

1. Prioritize recent data over historical data. Why things are broken right now is a more
pressing question than why they were broken 2 days ago.
2. Read latency matters – without this the more advanced tools built on top would not have
been practical
3. High availability trumps resource efficiency.

 We found that building a reliable, fault tolerant system was the most time
consuming part of the project. While the team prototyped a high performance,
compressed, in-memory TSDB in a very short period of time, it took several
more months of hard work to make it fault tolerant. However, the advantages of
fault tolerance were visible when the system successfully survived both real
and simulated failures.

POSTED IN UNCATEGORIZED

DATASTORES FACEBO OK TIME SERIES

< PREVIOUS NEXT >

Slacker: Fast Distribution with Lazy Docker Containers BTrDB: Optimizing Storage System Design for Timeseries
Processing
Sign up Sign in
18 comments bubbleflow 

Start a conversation ...

Finding surprising patterns in a time series database in linear time and space | the morning
paper (guest)
8 years ago

[…] the Facebook Gorilla paper, the authors mentioned a number of additional time series
analysis techniques they’d […]

 Share  Vote  Reply

Big Analytics Roundup (May 9, 2016) | The Big Analytics Blog (guest) 8 years ago

[…] Gorilla: a fast, scalable in-memory time series database. […]

 Share  Vote  Reply

Towards parameter-free data mining | the morning paper (guest) 8 years ago

[…] time series paper today from the Facebook Gorilla references. Keogh et al. describe an in‐
credibly simple and easy to implement scheme that does […]

 Share  Vote  Reply

Jay 8 years ago

I would like to explore and evaluate.. I don't find the software for download for installation..
Can you please let us know if this software is available for public to use and explore? Is it an
open source?

 Share  Vote  Reply


  Read 2 replies

Jay 8 years ago

From where shall I download the software for evaluation?

 Share  Vote  Reply

End of Term, and the power of compound interest | the morning paper (guest) 8 years ago

[…] Gorilla: A fast scalable in-memory time series database […]

 Share  Vote  Reply

The Morning Paper on Operability | the morning paper (guest) 8 years ago

[…] might also be interested in Gorilla, an in-memory TSDB that Facebook built to handle the 1
trillion data points per day gathered by […]

 Share  Vote  Reply

onon 8 years ago

it seems very complicated to decompress the timestamp and value?

 Share  Vote  Reply

Kraken: Leveraging live traffic tests to identify and resolve resource utilization bottlenecks in
large scale web services | the morning paper (guest)
8 years ago

[…] under test to adjust the stress it is putting on systems. The traffic shifting module queries
Gorilla for system health before determining the next traffic shift to the system under test.
Health metric […]

 Share  Vote  Reply

So that was 2016 | the morning paper (guest) 8 years ago

[…] Gorilla: A fast, scalable, in-memory time-series database […]

 Share  Vote  Reply

Terms of use - Privacy - Report a bug powered by

Gorilla A Fast Scalable InMemory DB
No ratings yet
Gorilla A Fast Scalable InMemory DB
12 pages
Time Series Database
No ratings yet
Time Series Database
4 pages
Streams 1
No ratings yet
Streams 1
33 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Import Contents BPB2-0026-0036-Httpwww Wi PB Edu Plplikinaukazeszytyz2rybak-Full
No ratings yet
Import Contents BPB2-0026-0036-Httpwww Wi PB Edu Plplikinaukazeszytyz2rybak-Full
20 pages
Ds MODULE-1
No ratings yet
Ds MODULE-1
11 pages
Zhao Xiaojian
No ratings yet
Zhao Xiaojian
114 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
02 StreamsAlgorithms
No ratings yet
02 StreamsAlgorithms
93 pages
Module 4
No ratings yet
Module 4
20 pages
Decaying Window
No ratings yet
Decaying Window
16 pages
Atc20 Visheratin
No ratings yet
Atc20 Visheratin
14 pages
B43 BDA Exp7
No ratings yet
B43 BDA Exp7
12 pages
Ch05a Streams1
No ratings yet
Ch05a Streams1
48 pages
An Array
No ratings yet
An Array
10 pages
Mmd04A Streams
No ratings yet
Mmd04A Streams
78 pages
ASSIGNMENT Sonali Raghuvanshi
No ratings yet
ASSIGNMENT Sonali Raghuvanshi
15 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
DSA Assignment I006
No ratings yet
DSA Assignment I006
8 pages
Counting Ones in A Window: The Cost of Exact Counts
100% (1)
Counting Ones in A Window: The Cost of Exact Counts
13 pages
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Awesome Big Data Algorithms
No ratings yet
Awesome Big Data Algorithms
37 pages
Collection of Raspberry Pi Projects
From Everand
Collection of Raspberry Pi Projects
Guillermo Perez Guillen
5/5 (1)
Dsa 2 PDF
No ratings yet
Dsa 2 PDF
12 pages
Time Series Databases
100% (2)
Time Series Databases
81 pages
NoSQL Database For Software
No ratings yet
NoSQL Database For Software
49 pages
Representation of Integers and Reals
No ratings yet
Representation of Integers and Reals
9 pages
Bda Unit - 2
No ratings yet
Bda Unit - 2
12 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Intro CH 1
No ratings yet
Intro CH 1
5 pages
Statistical Data Mining: Edward J. Wegman
No ratings yet
Statistical Data Mining: Edward J. Wegman
61 pages
School Function
No ratings yet
School Function
10 pages
Unit V 1
No ratings yet
Unit V 1
23 pages
72 Soham Naik BDA EXP7
No ratings yet
72 Soham Naik BDA EXP7
3 pages
图
No ratings yet
图
14 pages
Basics of Data Structures and Algorithms
No ratings yet
Basics of Data Structures and Algorithms
40 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Tarsnap Mastery: IT Mastery, #6
From Everand
Tarsnap Mastery: IT Mastery, #6
Michael W. Lucas
No ratings yet
Feature-Based Time-Series Analysis
No ratings yet
Feature-Based Time-Series Analysis
28 pages
Com124 Data Structure Note-1-1
No ratings yet
Com124 Data Structure Note-1-1
30 pages
Time Series Databases Explained.
No ratings yet
Time Series Databases Explained.
32 pages
Data Oriented Design
No ratings yet
Data Oriented Design
17 pages
Advanced Data Types and New Applications: Solutions To Practice Exercises
No ratings yet
Advanced Data Types and New Applications: Solutions To Practice Exercises
4 pages
Time-Series, Graph Database Deep Dive
No ratings yet
Time-Series, Graph Database Deep Dive
20 pages
Data Structures and Algorithms BBIT 2.2 L1
No ratings yet
Data Structures and Algorithms BBIT 2.2 L1
5 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Influxdb 2017
No ratings yet
Influxdb 2017
45 pages
CISC 867 Deep Learning: 12. Recurrent Neural Networks
No ratings yet
CISC 867 Deep Learning: 12. Recurrent Neural Networks
72 pages
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Algorithm Performance On Modern Architectures
100% (5)
Algorithm Performance On Modern Architectures
7 pages
02 Data Stru
No ratings yet
02 Data Stru
59 pages
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
No ratings yet
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
14 pages
Big Data Unit Ii Notes
No ratings yet
Big Data Unit Ii Notes
19 pages
Data Structure
No ratings yet
Data Structure
20 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
No ratings yet
Assumption-Free Anomaly Detection in Time Series: Figure 1. A Snapshot of The Anomaly Detection Tool
4 pages
Overview of Streaming-Data Algorithms
No ratings yet
Overview of Streaming-Data Algorithms
10 pages
Data Structures Final Fall 2018 - SOLUTION
No ratings yet
Data Structures Final Fall 2018 - SOLUTION
8 pages
DGIM Algorithm Theory Explanation
0% (1)
DGIM Algorithm Theory Explanation
2 pages
Servlets JSP
No ratings yet
Servlets JSP
40 pages
Multiprocessor System Architecture
No ratings yet
Multiprocessor System Architecture
11 pages
Admin Assistant Dispatch JD-111410
No ratings yet
Admin Assistant Dispatch JD-111410
2 pages
BS EN 10204 - Type 3.2 Inspection Certification
100% (1)
BS EN 10204 - Type 3.2 Inspection Certification
2 pages
Configuring The OM Channel On HUAWEI DBS3900
100% (1)
Configuring The OM Channel On HUAWEI DBS3900
3 pages
JSS-720EMaintenance 020419 SW
No ratings yet
JSS-720EMaintenance 020419 SW
12 pages
Exam 1 EE 371 Summer 2000: University of Washington Department of Electrical Engineering
No ratings yet
Exam 1 EE 371 Summer 2000: University of Washington Department of Electrical Engineering
6 pages
Masterchef Tcnicas de Pastelera Profesional Spanish Edition by Mariana Sebess B00ieh62jo
0% (1)
Masterchef Tcnicas de Pastelera Profesional Spanish Edition by Mariana Sebess B00ieh62jo
5 pages
Mathematics in The Modern World
No ratings yet
Mathematics in The Modern World
3 pages
SB Binomial Distribution
67% (3)
SB Binomial Distribution
6 pages
Nov18 PDF
No ratings yet
Nov18 PDF
448 pages
Algorithm Complexity I
No ratings yet
Algorithm Complexity I
5 pages
Mobile Robot Localization and Mapping Using The Kalman Filter
No ratings yet
Mobile Robot Localization and Mapping Using The Kalman Filter
53 pages
Theory of Sound PDF
No ratings yet
Theory of Sound PDF
500 pages
Image Maps in Drupal - Max Bronsema
No ratings yet
Image Maps in Drupal - Max Bronsema
2 pages
Upgrading To CitectSCADA Version 6
No ratings yet
Upgrading To CitectSCADA Version 6
6 pages
Trixbox2 - Without - Tears User Manual
No ratings yet
Trixbox2 - Without - Tears User Manual
248 pages
Printer File AS400
No ratings yet
Printer File AS400
154 pages
TL Bts Howto
No ratings yet
TL Bts Howto
7 pages
FE Practice 1 Completed PDF
No ratings yet
FE Practice 1 Completed PDF
20 pages
University of Central Punjab: Object Oriented Programming
No ratings yet
University of Central Punjab: Object Oriented Programming
3 pages
Fragebogen Vor Dem Beginn Des Praktikums
No ratings yet
Fragebogen Vor Dem Beginn Des Praktikums
1 page
Dada
No ratings yet
Dada
3 pages
FANUC 16,18 IPA Parameter Manual
No ratings yet
FANUC 16,18 IPA Parameter Manual
76 pages
Welcome To Training: Google Calendar
No ratings yet
Welcome To Training: Google Calendar
29 pages
Test 01
No ratings yet
Test 01
2 pages
Bi Apps Financial Analytics On Jde
No ratings yet
Bi Apps Financial Analytics On Jde
57 pages
Debugging and The Scientific Method
No ratings yet
Debugging and The Scientific Method
7 pages
095866
No ratings yet
095866
9 pages
Native Instruments - Guitar Rig 5 5.2.2 Standalone, VST, Aax x86 x64 (No Install) (03.06.2016)
No ratings yet
Native Instruments - Guitar Rig 5 5.2.2 Standalone, VST, Aax x86 x64 (No Install) (03.06.2016)
2 pages

Gorilla - A Fast, Scalable, In-Memory Time Series Database - The

Uploaded by

Gorilla - A Fast, Scalable, In-Memory Time Series Database - The

Uploaded by

the morning paper

a random walk through Computer Science research, by Adrian Colyer

Made delightfully fast by

Gorilla: A fast, scalable, in-memory time series

 An important requirement to operating [these] large scale services is to

Store 2 billion unique time series, identifiable via a string key

Time series compression in Gorilla

 Gorilla compresses data points within a time series with no additional

1. Calculate the delta of deltas: D = (tn – tn-1) – (tn-1 – tn-2)

This produces a data stream that looks like this:

The first value is stored with no compression

Building on top of Gorilla

 The correlation engine calculates the Pearson Product-Moment Correlation

Summing up lessons learned the authors provide three further takeaways:

DATASTORES FACEBO OK TIME SERIES

< PREVIOUS NEXT >

Start a conversation ...

 Share  Vote  Reply

[…] Gorilla: a fast, scalable in-memory time series database. […]

 Share  Vote  Reply

 Share  Vote  Reply

Jay 8 years ago

 Share  Vote  Reply

Jay 8 years ago

From where shall I download the software for evaluation?

 Share  Vote  Reply

[…] Gorilla: A fast scalable in-memory time series database […]

 Share  Vote  Reply

 Share  Vote  Reply

onon 8 years ago

it seems very complicated to decompress the timestamp and value?

 Share  Vote  Reply

 Share  Vote  Reply

So that was 2016 | the morning paper (guest) 8 years ago

[…] Gorilla: A fast, scalable, in-memory time-series database […]

 Share  Vote  Reply

Terms of use - Privacy - Report a bug powered by

You might also like