《ApacheKylin在大数据系统中应用课件.ppt》由会员分享,可在线阅读,更多相关《ApacheKylin在大数据系统中应用课件.ppt(36页珍藏版)》请在三一办公上搜索。
1、Apache Kylin,OLAP on Hadoop,http:/kylin.io,Agenda,Whats Apache Kylin?Tech HighlightsPerformanceRoadmapQ&A,Extreme OLAP Engine for Big Data,Kylin is an open source Distributed Analytics Engine from eBay thatprovides SQL interface and multi-dimensional analysis(OLAP)onHadoop supporting extremely large
2、 datasets,Whats Kylin,kylin/kiln/麒麟,-n.(in Chinese art)a mythical animal of composite form,Open Sourced on Oct 1st,2014,Be accepted as Apache Incubator Project on Nov 25th,2014,Big Data EraMore and more data becoming available on HadoopLimitations in existing Business Intelligence(BI)Tools,Limited s
3、upport for HadoopData size growing exponentiallyHigh latency of interactive queriesScale-Up architecture,Challenges to adopt Hadoop as interactive analysis system,Majority of analyst groups are SQL savvyNo mature SQL interface on HadoopOLAP capability on Hadoop ecosystem not ready yet,5,Why notBuild
4、 an engine from scratch?,Extreme Scale OLAP Engine,Kylin is designed to query 10+billions of rows on Hadoop,ANSI SQL Interface on Hadoop,Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions,Seamless Integration with BI Tools,Kylin currently offers integration capability with BI
5、 Tools like Tableau.,Interactive Query Capability,Users can interact with Hive tables at sub-second latency,MOLAP Cube,Define a data model from Hive tables and pre-build in Kylin,Scale Out Architecture,Query server cluster supports thousands concurrent users and provide high availability,Features Hi
6、ghlights,Compression and Encoding SupportIncremental Refresh of CubesApproximate Query Capability for distinct count(HyperLogLog)Leverage HBase Coprocessor for query latencyJob Management and MonitoringEasy Web interface to manage,build,monitor and query cubesSecurity capability to set ACL at Cube/P
7、roject LevelSupport LDAP Integration,Features Highlights,Cube Designer,Job Management,Query and Visualization,Tableau Integration,eBay,90%query 5 seconds,Baidu,Baidu Map internal analysis,Many other Proof of Concepts,Bloomberg Law,British GAS,JD,Microsoft,StubHub,Tableau,Who are using Kylin,http:/ky
8、lin.io,Agenda,Whats Apache Kylin?Tech HighlightsPerformanceRoadmapQ&A,OLAP,Cube,Kylin Architecture Overview,15,SQL-Based Tool(BI Tools:Tableau)JDBC/ODBCSQL,Online AnalysisData Flow Offline Data Flow Clients/Users interactive withKylin via SQL OLAP Cube is transparent tousers,Mid Latency-MinutesHadoo
9、pHiveStar Schema Data,Low Latency-SecondsDataCube(HBase)Key Value Data,3rd Party App(Web App,Mobile)REST APISQL,REST ServerQuery EngineRoutingMetadataCube Build Engine(MapReduce),Cube:,Fact Table:Dimensions:Measures:Storage(HBase):,Dim,Dim,DimFact,SourceStar Schema,Column Family,Row Keyrow Arow Brow
10、 C,ColumnVal 1Val 2Val 3,TargetHBase Storage,MappingCube Metadata,Data ModelingEnd User,Cube Modeler,Admin,time,item,time,item,location,time,item,location,supplier,time,item,location,supplier,time,location,Time,supplier,item,location,item,supplier,location,supplier,time,item,supplier,time,location,s
11、upplier,item,location,supplier,1-D cuboids,2-D cuboids,3-D cuboids,4-D(base)cuboid,Base vs.aggregate cells;ancestor vs.descendant cells;parent vs.child cells,1.2.3.4.5.,(9/15,milk,Urbana,Dairy_land)-(9/15,milk,Urbana,*)-(*,milk,Urbana,*)-(*,milk,Chicago,*)-(*,milk,*,*)-,OLAP Cube Balance between Spa
12、ce and TimeCuboid=one combination of dimensionsCube=all combination of dimensions(all cuboids)0-D(apex)cuboid,Cube Build Job Flow,How To Store Cube?HBase Schema,Dynamic data management framework.Formerly known as Optiq,Calcite is an Apache incubator project,used byApache Drill and Apache Hive,among
13、others.http:/optiq.incubator.apache.org,How to Query Cube?,Query Engine Calcite,Metadata SPI Provide table schema from Kylin metadataOptimize Rule Translate the logic operator into Kylin operatorRelational Operator Find right cube Translate SQL into storage engine API call Generate physical execute
14、plan by linq4j java implementationResult Enumerator Translate storage engine result into java implementation result.SQL Function Add HyperLogLog for distinct count Implement date time related functions(i.e.Quarter),How to Query Cube?,Kylin Extensions on Calcite,Query Engine Kylin Explain Plan,SELECT
15、 test_cal_dt.week_beg_dt,test_category.category_name,test_category.lvl2_name,test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_name,SUM(test_kylin_fact.price)AS GMV,COUNT(*)AS TRANS_CNTFROM test_kylin_fact,LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt=test_cal_dt.cal_dt,LEFT
16、 JOIN test_category ON test_kylin_fact.leaf_categ_id=test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id=,test_category.site_id,LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id=test_sites.site_id,WHERE test_kylin_fact.seller_id=123456OR test_kylin_fact.lstg_format_name=New,GROUP BY test
17、_cal_dt.week_beg_dt,test_category.category_name,test_category.lvl2_name,test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_name,OLAPToEnumerableConverter,OLAPProjectRel(WEEK_BEG_DT=$0,category_name=$1,CATEG_LVL2_NAME=$2,CATEG_LVL3_NAME=$3,LSTG_FORMAT_NAME=$4,SITE_NAME=$5,GMV=CA
18、SE(=($7,0),null,$6),TRANS_CNT=$8),OLAPAggregateRel(group=0,1,2,3,4,5,agg#0=$SUM0($6),agg#1=COUNT($6),TRANS_CNT=COUNT()OLAPProjectRel(WEEK_BEG_DT=$13,category_name=$21,CATEG_LVL2_NAME=$15,CATEG_LVL3_NAME=$14,LSTG_FORMAT_NAME=$5,SITE_NAME=$23,PRICE=$0)OLAPFilterRel(condition=OR(=($3,123456),=($5,New),
19、OLAPJoinRel(condition=($2,$25),joinType=left),OLAPJoinRel(condition=AND(=($6,$22),=($2,$17),joinType=left),OLAPJoinRel(condition=($4,$12),joinType=left),OLAPTableScan(table=DEFAULT,TEST_KYLIN_FACT,fields=0,1,2,3,4,5,6,7,8,9,10,11)OLAPTableScan(table=DEFAULT,TEST_CAL_DT,fields=0,1),OLAPTableScan(tabl
20、e=DEFAULT,test_category,fields=0,1,2,3,4,5,6,7,8),OLAPTableScan(table=DEFAULT,TEST_SITES,fields=0,1,2),Plugin-able storage engine,Common iterator interface for storage engineIsolate query engine from underline storage,Translate cube query into HBase table scan,Columns,Groups Cuboid IDFilters-Scan Ra
21、nge(Row Key)Aggregations-Measure Columns(Row Values),Scan HBase table and translate HBase result into cube result,HBase Result(key+value)-Cube Result(dimensions+measures),How to Query Cube?,Storage Engine,Curse of dimensionality:N dimension cube has 2N cuboid,Full Cube vs.Partial Cube,Hugh data volu
22、me,Dictionary EncodingIncremental Building,How to Optimize Cube?,Cube Optimization,Full Cube,Pre-aggregate all dimension combinations“Curse of dimensionality”:N dimension cube has 2N cuboid.,Partial Cube,To avoid dimension explosion,we divide the dimensions intodifferent aggregation groups,2N+M+L 2N
23、+2M+2L,For cube with 30 dimensions,if we divide these dimensions into 3group,the cuboid number will reduce from 1 Billion to 3 Thousands,230 210+210+210,Tradeoff between online aggregation and offline pre-aggregation,How to Optimize Cube?,Full Cube vs.Partial Cube,How to Optimize Cube?,Partial Cube,
24、Data cube has lost of duplicated dimension valuesDictionary maps dimension values into IDs that will reduce the memory and storagefootprint.Dictionary is based on Trie,How to Optimize Cube?,Dictionary Encoding,How to Optimize Cube?,Incremental Build,Streaming,ongoing effort,Cube is great,but,Sometim
25、es we want to drill down to row level informationCube takes time to build,how about real-time analysis?,Streaming with inverted index,streaming,Karfka,hourly/dailybatch,minutes batch,InvertedIndexReal-time Store,Kylin 0.8,Lambda ArchitectureSQL QueryHybrid StorageInterface,CubeHistoric Store,http:/k
26、ylin.io,Agenda,Whats Apache Kylin?Tech HighlightsPerformanceRoadmapQ&A,Kylin vs.Hive,#,Query,Type,Return Dataset,Query,On Kylin(s),Query,On Hive(s),Comments,1,High Level,Aggregation,4,0.129,157.437,1,217 times,23,Analysis QueryDrill Down to,Detail,22,669325,029,1.61512.058,109.206113.123,68 times9 t
27、imes,4,Drill Down to,Detail,524,780,22.42,6383.21,278 times,5,Data Dump,972,002,49.054,N/A,100500,200150,SQL#1,SQL#2,SQL#3,HiveKylin,HighLevelAggregation,AnalysisQuery,DrillDownto Detail,Low LevelAggregation,Transaction Level,Based on 12+B records case,Performance-Concurrency,Linear scale out with m
28、ore nodes,Performance-Query Latency,90%queries 5s,Green Line:90%tile queriesGray Line:95%tile queries,http:/kylin.io,Agenda,Whats Apache Kylin?Tech HighlightsPerformanceRoadmapQ&A,Kylin Evolution Roadmap,2015,2014,2013,Initial,Prototypefor MOLAP,Basic end to endPOC,MOLAP IncrementalRefresh ANSI SQL
29、ODBC Driver Web GUI ACL Open Source,HOLAP,Next Gen LambdaArch Automation,Streaming OLAP JDBC DriverNew GUI Excel Support more,CapacityManagementIn-MemoryAnalysis(TBD)Spark(TBD)more,TBD,Future,Sep,2013,Jan,2014,Sep,2014,H1,2015,Kylin Core,Fundamental framework ofKylin OLAP EngineExtension Plugins to
30、supportforadditionalfunctionsandfeaturesIntegration Lifecycle Management Supportto integrate with otherapplications,Interface Allows for thirdparty users tobuild more features via user-interface atop Kylin coreDriver ODBC and JDBC Drivers,Kylin OLAPCore,Extension Security Redis Storage Spark Engine Docker,Interface Web Console Customized BI Ambari/Hue Plugin,Integration ODBC Driver ETL Drill SparkSQL,Kylin Ecosystem,