0% found this document useful (0 votes)
28 views23 pages

StarRocks Intro

StarRock

Uploaded by

George Zhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views23 pages

StarRocks Intro

StarRock

Uploaded by

George Zhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

StarRocks

Real-time Analytics Made Easy


Company Profile
StarRocks—A Brief History

Founded: May 2020, HQ in


Silicon Valley

Source code:On GitHub, Doris


Spin Off, 80% new code

StarRocks Growth: Already


used by over 500 companies
Challenges Facing
Real-time Analytics
Why real-time analytics at scale is so hard?
Challenges Facing Todays Real-time Analytics

The bad case of de-normalized table


D1 D1.1

D4 Fact D2
• Added complexity to data pipeline
D3

• Added delay to data ingestion Joined Tables Flat Table

• Extra hardware, development,


and maintenance cost
Denormalize if possible
• It does NOT accommodate biz
changes easily Converting “star schema” to
denormalized “flat schema”
Challenges Facing Todays Real-time Analytics

Real-time analytics do
NOT handle updates well
Update / delete—Clickhouse
• Updates are forced into DB Implemented as “Alter table update” (mutations)
Asynchronous
• Either Merge on Read or
https://clickhouse.Yandex/docs/en/query_language/alter/#alter-mutations

Segment Replacement
• Query performance struggles
when processing updates/deletes
• Many use cases CANNOT
be supported
Challenges Facing Todays Real-time Analytics

High concurrency or
real-time? Pick one!

• ClickHouse CPU intensive


architecture does NOT
support high concurrency well
• Default to ONLY 100
concurrent queries
• Not suitable for large user base
or external facing applications
Challenges Facing Todays Real-time Analytics

Extremely difficult
to maintain shard1 shard2 shard3 Resharding shard1 shard2 shard3 shard4 shard5

• Difficult to scale out with distributed distributed

heavy data re-balancing local local local


Rebalancing
local local local local local
shard_key shard_key shard_key shard_key shard_key shard_key shard_key shard_key
%3 = 0 %3 = 1 %3 = 2 %5 = 0 %5 = 1 %5 = 2 %5 = 3 %5 = 4

• Relies on many 3rd-party


components
The issue we faced was that Clickhouse doesn’t
• Complex data pipeline automatically rebalance data in the cluster when
we add new shards.
• Increased Total Cost
of Ownership
ClickHouse scale out issues
StarRocks, Real-time
Analytics Made Easy
StarRocks Key Capabilities

Blazing Fast Queries Real-time Insight


• OLAP or Ad-hoc analytics • Second-level data freshness

01 • Sub-second query latency


• Flat table or multi-table joins
02 • High speed data ingestion
• Real-time update and delete
• Query billions of rows

Analytics for Everyone Simple Operations


• Supports 1000s of concurrent users • Reduced data pipeline complexity

03 • Up to 10000 QPS (Queries Per


Second)
04 • Linear scalability
• Reduced TCO
StarRocks—Real-time Queries Made Easy

2x to 6x faster in standard
benchmark testing

Blazing fast queries on star


schema and flat tables
De-normalized tables are
NOT required
Greatly simplifies data pipeline
Opens doors to more use cases
StarRocks—Real-time Processing Made Easy

Blazing fast queries with


frequent data updates Online ETL to build single Druid / Pinot /
Others Kafka Kafka Application
data wide table ClickHouse

Batch replace

Streaming or Change Data lake

Data Capture
Online
Complete update/ StarRocks
data
Kafka Application

delete functions
Sub-second query
latency even when data
is frequently updated
StarRocks—Real-time Operations Made Easy

High concurrency
and high throughput

Support 10000s of concurrent users


Resource isolation based on queries
Linear scalability for better
concurrency
Gives the power of data-driven to all!
StarRocks—Real-time Operations Made Easy

Simple and Elegant


Architecture
Client Application

MySQL Protocol
No dependencies on
external components
FE–Leader FE–Leader FE–Observer

Auto scaling without human Catalog Manager Catalog Manager … Catalog Manager

Query Optimizer Query Optimizer Query Optimizer


intervention
Linear, predictable scaling model BE BE BE BE

Execution Engine Execution Engine Execution Engine … Execution Engine


Reduced Operational Costs Storage Engine Storage Engine Storage Engine Storage Engine
Summary: StarRocks Makes Real-time Analytics Easy

Other Products

Superior query performance without


De-normalized table is a necessary evil
de-normalization

Maintains performance while data is


Struggle with updates and deletes
frequently updated/deleted

Low concurrency with only 10 – 100 users High concurrency with 10000s of users

Complex architecture, 3rd-party Simplified architecture, easy to scale,


dependency, hard to maintain and scale and reduced TCO
World-class Engineering Features

Cost Based Optimizer Intelligent materialized view


The cornerstone for distributed join Transparent query acceleration
in query execution

Fully vectorized query engine Resource management


The only query engine with vectorized No single runaway query can bring
execution across CPU, memory, and down the cluster
storage layer

Pipeline execution 100% SQL compatible with


Fully leverage CPU cores for MySQL client protocol
parallel processing Out of box support for all major BI tools
Announcing
StarRocks Cloud
Embargoed until
Announcing StarRocks Cloud 8am EDT, July 14th, 2022

Cloud native deployment of StarRocks real-time analytics

Automated elastic cloud resource management

Separate of storage and compute in the cloud

Reduced system administration effort

Lower and transparent infrastructure cost

Initially on AWS with Azure and GCP available soon


Case Studies
Use Case: Minerva, the Metrics Store at Airbnb

Denormalize compute Background and Pain Points


cost is very high
Minerva is Airbnb's internal unified
metric platform
DB Exports Dimension Table
Denormalized Vision: ”Define metrics once, use them
Wide Table
Logging Fact Table
everywhere"
… Over 12,000 metrics and 4,000 dimensions
3rd Party Data Dimension Table
Denormalized
Wide Table
Used for various data consumption
Fact Table scenarios, such as A/B testing, data
exploration, and data analysis.
In Minerva v1, multiple flat tables to feed
into Druid
Any change from source requires refresh
Row Data Minerva Query Layer the wide tables and it can take hours
Use Case: Minerva, the Enterprise Metrics
Store at Airbnb
Minerva on Demand, Powered
by StarRocks
No need for pre-aggregate like
DB Exports Dimension Table Druid does
Denormalized
Wide Table
Logging Fact Table
Handle high-cardinality dimensions
much better
3rd Party Data Dimension Table
Less than 20% of data need to be
de-normalized
Fact Table
The rest are queried in star schema
on the fly

Improved data freshness


and reduced TCO
Row Data Minerva StarRocks
Case: Real-time Analytics at a Social Media App
With 200mm+ Active Users

2017 2018 2019 2020 2021

From batch to real-time analytics: Redshift ➾ Hive/Presto ➾ ClickHouse ➾ StarRocks


Built real-time advertisement data platform on ClickHouse in 2019
Had stability, concurrency, and update issues when data volume and # of users grew
StarRocks replaced ClickHouse in 2021 as the new Advertisement Data Platform
Thank You

You might also like