0% found this document useful (0 votes)

158 views34 pages

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Apache Kylin is an open-source distributed analytics engine that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets. It allows for interactive queries on datasets as large as tens of terabytes by pre-aggregating data into a cube structure and storing the results in HBase for low latency queries. Kylin uses MapReduce jobs to build cubes incrementally from datasets in Hive and provides a web GUI and SQL interface for managing, building, and querying cubes.

Uploaded by

Katy K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

158 views34 pages

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Uploaded by

Katy K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Apache Kylin

Extreme OLAP Engine

for Big Data
Luke Han, Yang Li
2015-05-06
http://kylin.io | @ApacheKylin
About us
 Luke Han | [email protected] | @lukehq
 Apache Kylin PMC member & Product Owner
 Sr. Product Manager of eBay GDI
 from Shanghai China

 Yang Li | [email protected]
 Apache Kylin PMC member & Tech Leader
 Sr. Architect of eBay GDI
 from Shanghai China
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
What
kylin / ˈkiːˈlɪn / 麒麟 @ApacheKylin
--n. (in Chinese art) a mythical animal of composite form

Extreme OLAP Engine for Big Data

Kylin is an open source Distributed Analytics Engine, contributed
by eBay Inc., provides SQL interface and multi-dimensional analysis
(OLAP) on Hadoop supporting extremely large datasets

• Open Sourced on Oct 1st, 2014

• Be accepted as Apache Incubator Project on Nov 25th, 2014
• http://kylin.io (http://kylin.incubator.apache.org)
Why
Happiness

e
siz

Latency
10s
Balance Between Space and Time
0-D(apex) cuboid
time
OLAP Cube
item location supplier • Cuboid = one combination of dimensions
1-D cuboids • Cube = all combination of dimensions
(all cuboids)

time, item time, location item, location location, supplier

Time, supplier item, supplier 2-D cuboids

time, location, supplier

3-D cuboids
time, item, location time, item, supplier item, location, supplier
4-D(base) cuboid
time, item, location, supplier

• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
How
BI Tools, Web App…

ANSI SQL

Kylin

Map Reduce
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Feature Highlights
• Extremely Fast OLAP Engine at scale
• ANSI SQL Interface on Hadoop
• Seamless Integration with BI Tools, like Tableau
• Interactive Query Capability
• MOLAP Cube
• Incremental Build of Cubes
• Approximate Query Capability for Distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• User friendly Web GUI for manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
Define Data Model
Manage Jobs
Explore the Data
Interactive with BI Tool - Tableau
Who are using Kylin?
 eBay
- 90% query < 5 seconds

Case Cube Size Raw Records

User Session Analysis 26 TB 28+ billion rows
Traffic Analysis 21 TB 20+ billion rows
Behavior Analysis 560 GB 1.2+ billion rows

 Baidu
- Baidu Map internal analysis

 Many other Proof of Concepts

- Huawei, Bloomberg Law, British GAS, JD.com, Microsoft, StubHub, —from
Tableau …
mailing list
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Kylin Architecture Overview
3rd Party App SQL-Based Tool Online Analysis Data Flow
Offline Data Flow
(Web App, Mobile…) (BI Tools: Tableau…)
Clients/Users interactive with Kylin
REST API JDBC/ODBC via SQL
OLAP Cube is transparent to users
SQL SQL

REST Server

Query Engine
Mid Latency - Minutes Low Latency - Seconds
Routing

Hadoop Metadata Data

OLAP
Hive Cube
Cube
(HBase)

Cube Build Engine

(MapReduce…)
Star Schema Data Key Value Data
Data Modeling
End User Cube Modeler Admin

Cube: …
Row Key
Dim Fact Table: … Column

Dimensions: … row A Val 1

Measures: … row B Val 2
Fact
Storage(HBase): … row C Val 3

Dim Dim Column Family

Source Mapping Target

Star Schema Cube Metadata HBase Storage
Cube Build Job Flow
How to Store Cube - HBase Schema
Kylin Query Engine - Explain Plan
SELECT test_cal_dt.week_beg_dt, OLAPToEnumerableConverter
test_category.category_name, test_category.lvl2_name, OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1],
test_category.lvl3_name, test_kylin_fact.lstg_format_name, CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4],
test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])
COUNT(*) AS TRANS_CNT OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)],
FROM test_kylin_fact agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21],
test_cal_dt.cal_dt CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14],
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0])
test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])
= test_category.site_id OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
test_sites.site_id OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
WHERE test_kylin_fact.seller_id = 123456OR OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3,
test_kylin_fact.lstg_format_name = ’New' 4, 5, 6, 7, 8, 9, 10, 11]])
GROUP BY test_cal_dt.week_beg_dt, OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
test_category.category_name, test_category.lvl2_name, OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5,
test_category.lvl3_name, 6, 7, 8]])
test_kylin_fact.lstg_format_name,test_sites.site_name OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
Cube Optimization
 Curse of Dimensionality
 N dimension cube has 2 cuboid
N

 Full Cube vs. Partial Cube

 Huge Data Volume

 Dictionary Encoding
 Incremental Building
Full Cube vs. Partical Cube
 Full Cube
- Pre-aggregate all dimension combinations
- “Curse of dimensionality”: N dimension cube has 2N cuboid.
 Partial Cube
- To avoid dimension explosion, we divide the dimensions into different aggregation
groups
- 2N+M+L  2N + 2M + 2L
- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid
number will reduce from 1 Billion to 3 Thousands
- 230  210 + 210 + 210
- Tradeoff between online aggregation and offline pre-aggregation
Partical Cube
Incremental Build
What’s Next
 Improve cube algorithm
 Cube by segments, 30%-50% faster
 Build delay down to tens of minutes

 Streaming cubing
 Analyze real-time data
 Build delay down to seconds

 Spark
Cube by Layer
 The current algorithm 0-D Cuboid
MR
- Many MRs, the number of 1-D Cuboid
dimensions MR

- Huge shuffles, aggregation at 2-D Cuboid

reduce side, 100x of total cube MR

size 3-D Cuboid

4-D Cuboid
MR

Full Data
Cube by Segments

 The to-be algorithm, mapper mapper mapper

30%-50% faster Data Split Data Split Data Split
- 1 round MR ……

- Reduced shuffles, map side Cube Segment Cube Segment Cube Segment

aggregation, 20x total cube Merge Sort

size (Shuffle)

- Hourly incremental build done

in tens of minutes Final Cube
Streaming Cubing
 Cube is great but…
- Cube takes time to build, how about real-time analysis?
- Sometimes we want to drill down to row level information

 Streaming cubing
- Build micro cube segments from streaming
- Use inverted index to capture last minute data
Kylin Lambda Architecture

Inverted
Index

l ay

Query Engine
ds de Last Hour

ANSI SQL
n
co
se
Hybrid Storage
Streaming
Interface
mi
nu
tes
de
lay Cube

Before Last Hour

Adding Spark Support
 Cubing Efficiency
 MR is not optimal framework
 Spark Cubing Engine
 Source from SparkSQL
 Read data from SparkSQL instead of Hive
 Route to SparkSQL
 Unsupported queries be coved by SparkSQL
Agenda
 About Apache Kylin
 Feature Highlights
 Tech Highlights
 Roadmap
 Q&A
Kylin Evolution Roadmap
2013 2014 2015 2016 Future

H1, 2015
TBD
Next Gen
• Adv OLAP Functions
Oct, 2014 HybridOLAP • In-Memory Analysis
• Lambda Arch (TBD)
StreamingOLAP • Automation • Mobile (TBD)
• Streaming OLAP • Capacity • … more
• JDBC Driver Management
MOLAP •
•
New UI • Spark
Incremental Refresh
• ANSI SQL • Excel • … more
Jan, 2014
• ODBC Driver • SparkSQL
• Web GUI • … more
Prototype for • Tableau
• ACL
Sep, 2013
MOLAP • Open Source
• Basic end to end POC
Initial
Kylin Ecosystem
 Kylin Core
 Fundamental framework of Kylin OLAP Engine Integration Extension
 ODBC Driver  Security
 ETL  Redis Storage
 Extension  Drill  Spark Engine
 SparkSQL  Docker
 Plugins to support for additional functions and features
Kylin OLAP
 Integration Core
 Lifecycle Management Support to integrate with other
applications
Interface
 Interface  Web Console
 Customized BI
 Ambari/Hue Plugin
 Allows for third party users to build more features via user-
interface atop Kylin core
If you want to go fast, go alone.
If you want to go far, go together.
[email protected] --African Proverb

http://kylin.io

Oracle Metrics
No ratings yet
Oracle Metrics
31 pages
MJH Big Data
No ratings yet
MJH Big Data
28 pages
Cube Implementations
No ratings yet
Cube Implementations
29 pages
Data Cube
No ratings yet
Data Cube
42 pages
Cca498 - Final - Review - Jiajia
No ratings yet
Cca498 - Final - Review - Jiajia
86 pages
Data Warehouse and Data Mining - Unit 4
No ratings yet
Data Warehouse and Data Mining - Unit 4
14 pages
构建基于Apache Kylin的大数据分析平台讲话
No ratings yet
构建基于Apache Kylin的大数据分析平台讲话
37 pages
Data Warehousing & Modeling: Module - 2
No ratings yet
Data Warehousing & Modeling: Module - 2
144 pages
OLAP2
No ratings yet
OLAP2
53 pages
OLAP
No ratings yet
OLAP
25 pages
02 Olap
No ratings yet
02 Olap
41 pages
OLAP
No ratings yet
OLAP
8 pages
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
No ratings yet
Data Warehousing (Advanced Query Processing) : Carsten Binnig Donald Kossmann
55 pages
Informix Warehouse Accelerator Jun 9 2011 Spanish
No ratings yet
Informix Warehouse Accelerator Jun 9 2011 Spanish
46 pages
Testing
No ratings yet
Testing
10 pages
SQL Server Analysis Services (SSAS) Is The Technology From The Microsoft
No ratings yet
SQL Server Analysis Services (SSAS) Is The Technology From The Microsoft
5 pages
Implementation: Data Warehouse
No ratings yet
Implementation: Data Warehouse
56 pages
Ijettcs 2013 06 25 157
No ratings yet
Ijettcs 2013 06 25 157
3 pages
Data Warehousing, OLAP, and Data Mining
No ratings yet
Data Warehousing, OLAP, and Data Mining
28 pages
Lecture 8 p2
No ratings yet
Lecture 8 p2
43 pages
Difference Between Column-Stores and OLAP Data Cubes
No ratings yet
Difference Between Column-Stores and OLAP Data Cubes
3 pages
OLAP Implementation Techniques: High Performance Data Warehouse Design and Construction
No ratings yet
OLAP Implementation Techniques: High Performance Data Warehouse Design and Construction
34 pages
1.7 Efficient Processing of OLAP Queries & OLAP Servers
No ratings yet
1.7 Efficient Processing of OLAP Queries & OLAP Servers
14 pages
On-Line Analytical Processing: Analyzing Data Resources
No ratings yet
On-Line Analytical Processing: Analyzing Data Resources
60 pages
Online Analytical Processing (OLAP)
No ratings yet
Online Analytical Processing (OLAP)
34 pages
Synchronous State Machine Design: CO - (Eve) 2 Year
No ratings yet
Synchronous State Machine Design: CO - (Eve) 2 Year
27 pages
09 Data Serving
No ratings yet
09 Data Serving
46 pages
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
No ratings yet
Apache Calcite - A Foundational Framework For Optimized Query Processing Over Heterogeneous Data Sources - Sigmod-2018
23 pages
Olap Ssas
No ratings yet
Olap Ssas
69 pages
Online Analytical Processing (OLAP)
No ratings yet
Online Analytical Processing (OLAP)
43 pages
Data Warehousing: Online Analytical Processing (OLAP)
No ratings yet
Data Warehousing: Online Analytical Processing (OLAP)
44 pages
What Is OLAP - On - Line Analytical Processing
No ratings yet
What Is OLAP - On - Line Analytical Processing
34 pages
Database Systems I Data Warehousing: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 391
No ratings yet
Database Systems I Data Warehousing: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 391
36 pages
Topic 11 - OLAP Systems
No ratings yet
Topic 11 - OLAP Systems
24 pages
Olap 2
No ratings yet
Olap 2
46 pages
DW
No ratings yet
DW
4 pages
2015 - Clustering-binary-cube-dimensions-to-compute-relaxed-GRO - 2015 - Information-Sy
No ratings yet
2015 - Clustering-binary-cube-dimensions-to-compute-relaxed-GRO - 2015 - Information-Sy
19 pages
6
No ratings yet
6
2 pages
DWM Unit 1
No ratings yet
DWM Unit 1
67 pages
Data Warehousing - C02 - OLAP
No ratings yet
Data Warehousing - C02 - OLAP
46 pages
Data Warehouses and Data Cubes
No ratings yet
Data Warehouses and Data Cubes
21 pages
Lecture OLAP & Operation
No ratings yet
Lecture OLAP & Operation
47 pages
DW Seminar
No ratings yet
DW Seminar
13 pages
Parallel Querying of ROLAP Cubes in The Presence of Hierarchies
No ratings yet
Parallel Querying of ROLAP Cubes in The Presence of Hierarchies
8 pages
Chapter 2 and 3
No ratings yet
Chapter 2 and 3
89 pages
23 - Pratiksha Nimgade (ADBMS Assi-06)
No ratings yet
23 - Pratiksha Nimgade (ADBMS Assi-06)
8 pages
Data Mining New Notes Unit 2 PDF
No ratings yet
Data Mining New Notes Unit 2 PDF
15 pages
Chapter 3 Data Warehouse & OLAP
No ratings yet
Chapter 3 Data Warehouse & OLAP
17 pages
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
No ratings yet
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
37 pages
Enabling Scalable OLAP Directly On A Data Lakehouse Architecture
No ratings yet
Enabling Scalable OLAP Directly On A Data Lakehouse Architecture
39 pages
Data Warehousing & OLAP
No ratings yet
Data Warehousing & OLAP
57 pages
Module 2 DMDW
No ratings yet
Module 2 DMDW
132 pages
Module 2
No ratings yet
Module 2
19 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Data Mining and Warehosuing Lecture 02
No ratings yet
Data Mining and Warehosuing Lecture 02
22 pages
Data Warehousing and OLAP Technology For Data Mining
No ratings yet
Data Warehousing and OLAP Technology For Data Mining
30 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Amazon SimpleDB: LITE
From Everand
Amazon SimpleDB: LITE
Prabhakar Chaganti
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
DSP Question and DPSD
No ratings yet
DSP Question and DPSD
3 pages
SQL Server Internals English
No ratings yet
SQL Server Internals English
5 pages
Resume - AJAY - 8+years
No ratings yet
Resume - AJAY - 8+years
5 pages
B0700WJ - A - FDC280 OPC UA Client Driver User's Guide
No ratings yet
B0700WJ - A - FDC280 OPC UA Client Driver User's Guide
89 pages
Poornima Gupta Email: PH: 650 703 2554 Lead Java Engineer Summary of Qualifications
No ratings yet
Poornima Gupta Email: PH: 650 703 2554 Lead Java Engineer Summary of Qualifications
7 pages
PQM Interface Using Modbus
No ratings yet
PQM Interface Using Modbus
22 pages
Baseband Transmission DC
No ratings yet
Baseband Transmission DC
23 pages
Syllabus
No ratings yet
Syllabus
3 pages
Resonant and Soft-Switching Techniques in Power Electronics ECEN 5817
No ratings yet
Resonant and Soft-Switching Techniques in Power Electronics ECEN 5817
29 pages
Blok Diagram Samson Mixpad124x
No ratings yet
Blok Diagram Samson Mixpad124x
112 pages
CSC111
No ratings yet
CSC111
76 pages
Es 2303 Installation Guide en
No ratings yet
Es 2303 Installation Guide en
2 pages
Blog Python Tutorial For Beginners A Complete Guide
No ratings yet
Blog Python Tutorial For Beginners A Complete Guide
20 pages
Sapscript Example Report
No ratings yet
Sapscript Example Report
8 pages
d1 10 Porting U-Boot Drivers To Sel4 Mark Jenkinson & Stephen Williams
No ratings yet
d1 10 Porting U-Boot Drivers To Sel4 Mark Jenkinson & Stephen Williams
21 pages
B2C E-Commerce Custtomer Relation Management Based On The Longtail
No ratings yet
B2C E-Commerce Custtomer Relation Management Based On The Longtail
52 pages
Android Services With Examples
No ratings yet
Android Services With Examples
9 pages
Angad Singh
No ratings yet
Angad Singh
4 pages
EWS
No ratings yet
EWS
4 pages
Designing Mapreduce Algorithms: Anurag Sharma
No ratings yet
Designing Mapreduce Algorithms: Anurag Sharma
8 pages
? 1
No ratings yet
? 1
38 pages
A Cost Efficient Multi-Cloud Data Hosting Scheme Using Charm
No ratings yet
A Cost Efficient Multi-Cloud Data Hosting Scheme Using Charm
6 pages
2 610 Manual
No ratings yet
2 610 Manual
15 pages
Structure of Java Program - Javatpoint
No ratings yet
Structure of Java Program - Javatpoint
10 pages
Avaya POM Connectivity Guide - Engage 6.x
No ratings yet
Avaya POM Connectivity Guide - Engage 6.x
124 pages
A Seminar ON 5G Wireless Network: - Presented by
No ratings yet
A Seminar ON 5G Wireless Network: - Presented by
37 pages
Quiz 1 To Finals Adbms
No ratings yet
Quiz 1 To Finals Adbms
4 pages
An Optimized Proportional Resonant Current Contro - 2024 - International Journal
No ratings yet
An Optimized Proportional Resonant Current Contro - 2024 - International Journal
18 pages
Integration of Ldap With Samba
No ratings yet
Integration of Ldap With Samba
13 pages

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Uploaded by

Apache Kylin - Extreme OLAP Engine For Hadoop Presentation

Uploaded by

Apache Kylin

Extreme OLAP Engine

Extreme OLAP Engine for Big Data

• Open Sourced on Oct 1st, 2014

time, item time, location item, location location, supplier

Time, supplier item, supplier 2-D cuboids

time, location, supplier

Case Cube Size Raw Records

 Many other Proof of Concepts

Hadoop Metadata Data

Cube Build Engine

Dimensions: … row A Val 1

Dim Dim Column Family

Source Mapping Target

 Full Cube vs. Partial Cube

 Huge Data Volume

- Huge shuffles, aggregation at 2-D Cuboid

size 3-D Cuboid

 The to-be algorithm, mapper mapper mapper

aggregation, 20x total cube Merge Sort

- Hourly incremental build done

Before Last Hour

You might also like