《LESKOVEC, Jure et al. () Costeffective Outbreak Detection in Networks PPT.ppt》由会员分享,可在线阅读,更多相关《LESKOVEC, Jure et al. () Costeffective Outbreak Detection in Networks PPT.ppt(35页珍藏版)》请在三一办公上搜索。
1、Cost-effective Outbreak Detection in Networks,Jure Leskovec,Andreas Krause,Carlos Guestrin,Christos Faloutsos,Jeanne VanBriesen,Natalie Glance,Scenario 1:Water network,Given a real city water distribution networkAnd data on how contaminants spread in the networkProblem posed by US Environmental Prot
2、ection Agency,2,S,On which nodes should we place sensors to efficiently detect the all possible contaminations?,S,Scenario 2:Cascades in blogs,3,Blogs,Posts,Time ordered hyperlinks,Information cascade,Which blogs should one read to detect cascades as effectively as possible?,General problem,Given a
3、dynamic process spreading over the network We want to select a set of nodes to detect the process effectivelyMany other applications:EpidemicsInfluence propagationNetwork security,4,Two parts to the problem,Reward,e.g.:1)Minimize time to detection2)Maximize number of detected propagations3)Minimize
4、number of infected peopleCost(location dependent):Reading big blogs is more time consumingPlacing a sensor in a remote location is expensive,5,Problem setting,Given a graph G(V,E)and a budget B for sensorsand data on how contaminations spread over the network:for each contamination i we know the tim
5、e T(i,u)when it contaminated node uSelect a subset of nodes A that maximize the expected rewardsubject to cost(A)B,6,Reward for detecting contamination i,Overview,Problem definitionProperties of objective functionsSubmodularityOur solutionCELF algorithmNew boundExperimentsConclusion,7,Solving the pr
6、oblem,Solving the problem exactly is NP-hardOur observation:objective functions are submodular,i.e.diminishing returns,8,S1,S2,Placement A=S1,S2,S,New sensor:,Adding S helps a lot,S2,S4,S1,S3,Placement A=S1,S2,S3,S4,S,Adding S helps very little,Result 1:Objective functions are submodular,Objective f
7、unctions from Battle of Water Sensor Networks competition Ostfeld et al:1)Time to detection(DT)How long does it take to detect a contamination?2)Detection likelihood(DL)How many contaminations do we detect?3)Population affected(PA)How many people drank contaminated water?Our result:all are submodula
8、r,9,Background:Submodularity,Submodularity:For all placement s it holdsEven optimizing submodular functions is NP-hard Khuller et al,10,Benefit of adding a sensor to a small placement,Benefit of adding a sensor to a large placement,Background:Optimizing submodular functions,How well can we do?A gree
9、dy is near optimalat least 1-1/e(63%)of optimal Nemhauser et al 78But 1)this only works for unit cost case(each sensor/location costs the same)2)Greedy algorithm is slowscales as O(|V|B),11,a,b,c,a,b,c,d,d,reward,e,e,Greedy algorithm,Result 2:Variable cost:CELF algorithm,For variable sensor cost gre
10、edy can fail arbitrarily badlyWe develop a CELF(cost-effective lazy forward-selection)algorithma 2 pass greedy algorithmTheorem:CELF is near optimalCELF achieves(1-1/e)factor approximation CELF is much faster than standard greedy,12,Result 3:tighter bound,We develop a new algorithm-independent bound
11、 in practice much tighter than the standard(1-1/e)boundDetails in the paper,13,Scaling up CELF algorithm,Submodularity guarantees that marginal benefits decrease with the solution sizeIdea:exploit submodularity,doing lazy evaluations!(considered by Robertazzi et al for unit cost case),14,d,reward,Re
12、sult 4:Scaling up CELF,CELF algorithm:Keep an ordered list of marginal benefits bi from previous iterationRe-evaluate bi only for top sensorRe-sort and prune,15,a,b,c,a,b,c,d,d,reward,e,e,Result 4:Scaling up CELF,CELF algorithm:Keep an ordered list of marginal benefits bi from previous iterationRe-e
13、valuate bi only for top sensorRe-sort and prune,16,a,a,b,c,d,reward,e,Result 4:Scaling up CELF,CELF algorithm:Keep an ordered list of marginal benefits bi from previous iterationRe-evaluate bi only for top sensorRe-sort and prune,17,a,c,a,b,c,d,d,reward,e,Overview,Problem definitionProperties of obj
14、ective functionsSubmodularityOur solutionCELF algorithmNew boundExperimentsConclusion,18,Experiments:Questions,Q1:How close to optimal is CELF?Q2:How tight is our bound?Q3:Unit vs.variable costQ4:CELF vs.heuristic selectionQ5:Scalability,19,Experiments:2 case studies,We have real propagation dataBlo
15、g network:We crawled blogs for 1 yearWe identified cascades temporal propagation of informationWater distribution network:Real city water distribution networksRealistic simulator of water consumption provided by US Environmental Protection Agency,20,Case study 1:Cascades in blogs,We crawled 45,000 b
16、logs for 1 yearWe obtained 10 million postsAnd identified 350,000 cascades,21,Q1:Blogs:Solution quality,Our bound is much tighter13%instead of 37%,22,Old bound,Our bound,CELF,Q2:Blogs:Cost of a blog,Unit cost:algorithm picks large popular blogs:,Variable cost:proportional to the number of postsWe ca
17、n do much better when considering costs,23,Unit cost,Variable cost,Q4:Blogs:Heuristics,CELF wins consistently,24,Q5:Blogs:Scalability,CELF runs 700 times faster than simple greedy algorithm,25,Case study 2:Water network,Real metropolitan area water network(largest network optimized):V=21,000 nodesE=
18、25,000 pipes3.6 million epidemic scenarios(152 GB of epidemic data)By exploiting sparsity we fit it into main memory(16GB),26,Q1:Water:Solution quality,Again our bound is much tighter,27,Old bound,Our bound,CELF,Q3:Water:Heuristic placement,Again,CELF consistently wins,28,Water:Placement visualizati
19、on,Different objective functions give different sensor placements,29,Population affected,Detection likelihood,Q5:Water:Scalability,CELF is 10 times faster than greedy,30,Results of BWSN competition,31,Battle of Water Sensor Networks competitionOstfeld et al:count number of non-dominated solutions,Co
20、nclusion,General methodology for selecting nodes to detect outbreaksResults:Submodularity observationVariable-cost algorithm with optimality guaranteeTighter boundSignificant speed-up(700 times)Evaluation on large real datasets(150GB)CELF won consistently,32,Other results see our poster,Many more de
21、tails:Fractional selection of the blogsGeneralization to future unseen cascadesMulti-criterion optimization We show that triggering model of Kempe et al is a special case of out setting,33,Thank you!Questions?,Blogs:generalization,34,Blogs:Cost of a blog(2),But then algorithm picks lots of small blogs that participate in few cascadesWe pick best solution that interpolates between the costsWe can get good solutions with few blogs and few posts,35,Each curve represents solutions with the same score,