0% found this document useful (0 votes)
158 views34 pages

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Apache Kylin is an open-source distributed analytics engine that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets. It allows for interactive queries on datasets as large as tens of terabytes by pre-aggregating data into a cube structure and storing the results in HBase for low latency queries. Kylin uses MapReduce jobs to build cubes incrementally from datasets in Hive and provides a web GUI and SQL interface for managing, building, and querying cubes.

Uploaded by

Katy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views34 pages

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Apache Kylin is an open-source distributed analytics engine that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets. It allows for interactive queries on datasets as large as tens of terabytes by pre-aggregating data into a cube structure and storing the results in HBase for low latency queries. Kylin uses MapReduce jobs to build cubes incrementally from datasets in Hive and provides a web GUI and SQL interface for managing, building, and querying cubes.

Uploaded by

Katy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Apache Kylin

Extreme OLAP Engine


for Big Data
Luke Han, Yang Li
2015-05-06
http://kylin.io | @ApacheKylin
About us
 Luke Han | [email protected] | @lukehq
 Apache Kylin PMC member & Product Owner
 Sr. Product Manager of eBay GDI
 from Shanghai China

 Yang Li | [email protected]
 Apache Kylin PMC member & Tech Leader
 Sr. Architect of eBay GDI
 from Shanghai China
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
What
kylin / ˈkiːˈlɪn / 麒麟 @ApacheKylin
--n. (in Chinese art) a mythical animal of composite form

Extreme OLAP Engine for Big Data


Kylin is an open source Distributed Analytics Engine, contributed
by eBay Inc., provides SQL interface and multi-dimensional analysis
(OLAP) on Hadoop supporting extremely large datasets

• Open Sourced on Oct 1st, 2014


• Be accepted as Apache Incubator Project on Nov 25th, 2014
• http://kylin.io (http://kylin.incubator.apache.org)
Why
Happiness

e
siz

Latency
10s
Balance Between Space and Time
0-D(apex) cuboid
time
OLAP Cube
item location supplier • Cuboid = one combination of dimensions
1-D cuboids • Cube = all combination of dimensions
(all cuboids)

time, item time, location item, location location, supplier

Time, supplier item, supplier 2-D cuboids

time, location, supplier


3-D cuboids
time, item, location time, item, supplier item, location, supplier
4-D(base) cuboid
time, item, location, supplier

• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
How
BI Tools, Web App…

ANSI SQL

Kylin

Map Reduce
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Feature Highlights
• Extremely Fast OLAP Engine at scale
• ANSI SQL Interface on Hadoop
• Seamless Integration with BI Tools, like Tableau
• Interactive Query Capability
• MOLAP Cube
• Incremental Build of Cubes
• Approximate Query Capability for Distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• User friendly Web GUI for manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
Define Data Model
Manage Jobs
Explore the Data
Interactive with BI Tool - Tableau
Who are using Kylin?
 eBay
- 90% query < 5 seconds

Case Cube Size Raw Records


User Session Analysis 26 TB 28+ billion rows
Traffic Analysis 21 TB 20+ billion rows
Behavior Analysis 560 GB 1.2+ billion rows

 Baidu
- Baidu Map internal analysis

 Many other Proof of Concepts


- Huawei, Bloomberg Law, British GAS, JD.com, Microsoft, StubHub, —from
Tableau …
mailing list
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Kylin Architecture Overview
3rd Party App SQL-Based Tool Online Analysis Data Flow
Offline Data Flow
(Web App, Mobile…) (BI Tools: Tableau…)
Clients/Users interactive with Kylin
REST API JDBC/ODBC via SQL
OLAP Cube is transparent to users
SQL SQL

REST Server

Query Engine
Mid Latency - Minutes Low Latency - Seconds
Routing

Hadoop Metadata Data


OLAP
Hive Cube
Cube
(HBase)

Cube Build Engine


(MapReduce…)
Star Schema Data Key Value Data
Data Modeling
End User Cube Modeler Admin

Cube: …
Row Key
Dim Fact Table: … Column

Dimensions: … row A Val 1


Measures: … row B Val 2
Fact
Storage(HBase): … row C Val 3

Dim Dim Column Family

Source Mapping Target


Star Schema Cube Metadata HBase Storage
Cube Build Job Flow
How to Store Cube - HBase Schema
Kylin Query Engine - Explain Plan
SELECT test_cal_dt.week_beg_dt, OLAPToEnumerableConverter
test_category.category_name, test_category.lvl2_name, OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1],
test_category.lvl3_name, test_kylin_fact.lstg_format_name, CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4],
test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])
COUNT(*) AS TRANS_CNT OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)],
FROM test_kylin_fact agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21],
test_cal_dt.cal_dt CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14],
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0])
test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])
= test_category.site_id OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
test_sites.site_id OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
WHERE test_kylin_fact.seller_id = 123456OR OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3,
test_kylin_fact.lstg_format_name = ’New' 4, 5, 6, 7, 8, 9, 10, 11]])
GROUP BY test_cal_dt.week_beg_dt, OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
test_category.category_name, test_category.lvl2_name, OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5,
test_category.lvl3_name, 6, 7, 8]])
test_kylin_fact.lstg_format_name,test_sites.site_name OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
Cube Optimization
 Curse of Dimensionality
 N dimension cube has 2 cuboid
N

 Full Cube vs. Partial Cube

 Huge Data Volume


 Dictionary Encoding
 Incremental Building
Full Cube vs. Partical Cube
 Full Cube
- Pre-aggregate all dimension combinations
- “Curse of dimensionality”: N dimension cube has 2N cuboid.
 Partial Cube
- To avoid dimension explosion, we divide the dimensions into different aggregation
groups
- 2N+M+L  2N + 2M + 2L
- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid
number will reduce from 1 Billion to 3 Thousands
- 230  210 + 210 + 210
- Tradeoff between online aggregation and offline pre-aggregation
Partical Cube
Incremental Build
What’s Next
 Improve cube algorithm
 Cube by segments, 30%-50% faster
 Build delay down to tens of minutes

 Streaming cubing
 Analyze real-time data
 Build delay down to seconds

 Spark
Cube by Layer
 The current algorithm 0-D Cuboid
MR
- Many MRs, the number of 1-D Cuboid
dimensions MR

- Huge shuffles, aggregation at 2-D Cuboid


reduce side, 100x of total cube MR

size 3-D Cuboid


MR

4-D Cuboid
MR

Full Data
Cube by Segments

 The to-be algorithm, mapper mapper mapper


30%-50% faster Data Split Data Split Data Split
- 1 round MR ……

- Reduced shuffles, map side Cube Segment Cube Segment Cube Segment

aggregation, 20x total cube Merge Sort


size (Shuffle)

- Hourly incremental build done


in tens of minutes Final Cube
Streaming Cubing
 Cube is great but…
- Cube takes time to build, how about real-time analysis?
- Sometimes we want to drill down to row level information

 Streaming cubing
- Build micro cube segments from streaming
- Use inverted index to capture last minute data
Kylin Lambda Architecture

Inverted
Index

l ay

Query Engine
ds de Last Hour

ANSI SQL
n
co
se
Hybrid Storage
Streaming
Interface
mi
nu
tes
de
lay Cube

Before Last Hour


Adding Spark Support
 Cubing Efficiency
 MR is not optimal framework
 Spark Cubing Engine
 Source from SparkSQL
 Read data from SparkSQL instead of Hive
 Route to SparkSQL
 Unsupported queries be coved by SparkSQL
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Kylin Evolution Roadmap
2013 2014 2015 2016 Future

H1, 2015
TBD
Next Gen
• Adv OLAP Functions
Oct, 2014 HybridOLAP • In-Memory Analysis
• Lambda Arch (TBD)
StreamingOLAP • Automation • Mobile (TBD)
• Streaming OLAP • Capacity • … more
• JDBC Driver Management
MOLAP •

New UI • Spark
Incremental Refresh
• ANSI SQL • Excel • … more
Jan, 2014
• ODBC Driver • SparkSQL
• Web GUI • … more
Prototype for • Tableau
• ACL
Sep, 2013
MOLAP • Open Source
• Basic end to end POC
Initial
Kylin Ecosystem
 Kylin Core
 Fundamental framework of Kylin OLAP Engine Integration Extension
 ODBC Driver  Security
 ETL  Redis Storage
 Extension  Drill  Spark Engine
 SparkSQL  Docker
 Plugins to support for additional functions and features
Kylin OLAP
 Integration Core
 Lifecycle Management Support to integrate with other
applications
Interface
 Interface  Web Console
 Customized BI
 Ambari/Hue Plugin
 Allows for third party users to build more features via user-
interface atop Kylin core
If you want to go fast, go alone.
If you want to go far, go together.
[email protected] --African Proverb

http://kylin.io

You might also like