《不连续及不稳定数据管理英文版文档资料课件.ppt》由会员分享,可在线阅读,更多相关《不连续及不稳定数据管理英文版文档资料课件.ppt(35页珍藏版)》请在三一办公上搜索。
1、Efficient Management of Inconsistent and Uncertain Data,Rene J. MillerUniversity of Toronto,Contributors,Ariel Fuxman, PhD ThesisMicrosoft Search LabsJim Gray SIGMOD 2019 Dissertation AwardPeriklis Andritsos, PhDJiang Du, MSElham Fazli, MSDiego Fuxman, Undergrad,3,Dirty Databases,The presence of dir
2、ty data is a major problem in enterprisesTraditional solution: data cleaning,No. I dont see Any problem with the data,Limitations of Data Cleaning,Semi-automatic processRequires highly-qualified domain experts Time consumingMay not be possible to wait until the database is cleanOperational systems a
3、nswer queries assuming clean data,5,Our Work,Identify classes of queries for which we can obtain meaningful answers from potentially dirty databasesShow how to do it efficiently and reusing existing database technology,Why is this Business Intelligence?,Business intelligence (BI) refers to technolog
4、ies, applications and practices for the collection, integration, analysis, and presentation of information.The goal of BI is to support better decision making, based on information.DBMS should provide meaningful query answers even over data that is dirty,7,Outline,Introduction Semantics for dirty da
5、tabases Contributions Conclusions,8,Outline,Introduction Semantics for dirty databases Contributions Conclusions,9,A Data Integration Example,Integrating customer data,Sales,Shipping,Customer Support,Web Forms,Demographic Data,IntegratedCustomerDatabase,10,Matching and Merging,Web,Sales,Matching and
6、 merging are two fundamental tasks in data integration,11,True Disagreement Between Sources,Web,Sales,Whats Peters salary?,12,Inconsistent Integrated Databases,In the absence of complete resolution rules,SATISFY custid KEY,VIOLATES custid KEY,Web,Sales,Inconsistent Integrated Database,13,Query: “Get
7、 customers who make more than 100K”,sales,web,sales/web,sales,web,Peter,Paul,Mary,Are we sure that we want to offer a card to Peter?,Example: Offering a Platinum credit card,Querying Inconsistent Databases,14,Aggressive: Get customers who possibly make more than 100KPeter, Paul, Mary Conservative: G
8、et customers who certainly make more than 100KPaul, Mary,Querying Inconsistent Databases,15,Formal Semantics,Related to semantics for querying incomplete data Imielinski Lipski 84, Abiteboul Duschka 98Possible world: “complete” databasesConsistent answersProposed by Arenas, Bertossi, and Chomicki in
9、 2019Corresponds to conservative semanticsPossible world: “consistent” databases,16,sales,web,sales/web,sales,web,Inconsistent database,Repairs,Key: custid,Consistent Answers,17,CONSISTENT ANSWERSAnswers obtainedno matter which repair we choose,Query=“Get customers who make more than 100K”,q,q,q,q,C
10、ONSISTENT ANSWER=Paul,Mary,Repairs,Consistent Answers,18,Outline,Introduction Semantics for dirty databases Contributions Conclusions,19,When We Started,Semantics well understoodProblemPotentially HUGE number of repairs!Negative results Chomicki et al 02, Arenas et al. 01, Cali et al 04 Few tractabi
11、lity results Arenas et al. 99, Arenas et al. 01Logic programming approaches Bravo and Bertossi 03, Eiter et al. 03Expressive queries and constraintsComputationally expensiveApplicable only to small databases with small number of inconsistencies,20,Our Proposal: ConQuer,Commercial databaseengine,SQL
12、query q Keys,RewrittenSQL query Q*,ConQuersRewriting Algorithm,Inconsistentdatabase,Consistent answer to q,21,Class of Rewritable Queries,ConQuer handles a broad class of SPJ queries withSet semanticsBag semantics, grouping, and aggregationNo restrictions onNumber of relationsNumber of joinsConditio
13、ns or built-in predicatesKey-to-key joinsThe class is “maximal”,22,Why not all SPJ queries?,Some SPJ queries cannot be rewritten into SQLConsistent query answering is coNP-complete even for some SPJ queries and key constraintsMaximality of ConQuers classMinimal relaxations lead to intractabilityRest
14、rictions only onNonkey-to-nonkey joinsSelf joinsNonkey-to-key joins that form a cycle,23,Example: A Rewritable Query,SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount) as revenue, c_acctbal, n_name, c_address, c_phone, c_commentFROM customer, orders, lineitem, nationWHERE c_custkey = o
15、_custkey and l_orderkey = o_orderkey and o_orderdate = 1993-10-01 and o_orderdate date(1993-10-01) + 3 MONTHS and l_returnflag = R and c_nationkey = n_nationkeyGROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_commentORDER BY revenue desc,TPC-H Query 10,Rewritings Can Get Quite Co
16、mplex,Rewriting of TPC-H Query 10,Can this rewriting be executed efficiently?,1.7 overhead20 GB database, 5% inconsistency,25,Experimental Evaluation,GoalsQuantify the overhead of the rewritingsAssess the scalability of the approach Determine sensitivity of the rewritten queries to level of inconsis
17、tency of the instanceQueries and databasesRepresentative decision support queries (TPC-H benchmark)TPC-H databases, altered to introduce inconsistenciesDatabase parametersdatabase sizepercentage of the database that is inconsistentconflicts per key value (in inconsistent portion),26,Worst Case5.8 ov
18、erheadSelectivity 98.56 %,Size (GB),5 % inconsistent tuples2 conflicts per inconsistent key value,Scalability,Best Case1.2 overheadSelectivity 0.001 %,27,Contributions Theory,Formal characterization of a broad class of queries For which computing consistent answers is tractable under key constraints
19、That can be rewritten into first-order/SQLQuery rewriting algorithms for a class of Select-Project-Join queries With set semanticsWith bag semantics, grouping, and aggregationMaximality of the class of queries,28,Contributions Practice,Implementation of ConQuer Designed to compute consistent answers
20、 efficientlyMultiple rewriting strategiesExperimental validation of efficiency and scalability Representative queries from TPC-HLarge databases,Uncertain Data,Web,Sales,Integrated Database,0.3,0.7,PROVENANCE INFORMATION(e.g., source reputation),0.3,0.7,1,0.3,0.7,30,Publications and Demo,These and ot
21、her contributions appear inICDT05/JCSS06SIGMOD05ICDE06PODS06/TODS06VLDB06Demo given at VLDB05queens.db.toronto.edu/project/conquer/demo2/,31,Outline,Introduction Semantics for dirty databases Contributions Conclusions,32,A Virtuous Cycle,Query Answering,Data Integration,Recognize and characterize in
22、consistent data,Use knowledge about inconsistencies to: give better answers suggest ways to clean the database,33,Beyond the Enterprise,Can we apply principled models of inconsistency or uncertainty to the Web?Different assumptionsUncertainty in queriesTheres never a “true” answerChallengeBuild models based on user preferencesLeverage massive repositories of user behavior data,34,THANK YOU,Plug: Discovering Data Quality Rules, Fei ChiangThursday 11:15am Research Session 33,谢谢,