《《智能信息检索》PPT课件.ppt》由会员分享,可在线阅读,更多相关《《智能信息检索》PPT课件.ppt(97页珍藏版)》请在三一办公上搜索。
1、智能信息检索,杜小勇教授,中国人民大学文继荣教授,微软亚洲研究院,课程介绍,课程历史,2007年9月27日,我们和微软亚洲研究院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动;2008年春季学期,第一次与微软合作,开设智能信息检索课程;来自MSRA的9位研究员共进行了11次讲座;参考文献:从智能信息检索看微软前瞻性课程,计算机教育,授课风格,IR基础知识+专题讲座专题讲座由微软研究员担任,信息量非常大考核方式:(1)选择某一个专题写一个综述性质的报告,包括:研究的问题是什么?该领域的理论基础是什么?技术难点在那里?目前大致有什么解决问题的手段和方法?研究这些问题
2、的实验方法和评价体系是什么?提出自己的观点.(2)将打印好的文章在最后一节课交给助教,将分发给相关的老师进行评阅。(3)平时考核,主要是参与讨论的情况.,授课内容,基础知识基本模型基本实现技术评价方法核心技术与系统RankingInformation ExtractionLog MiningSystem Implementation应用Image searchMulti-language searchKnowledge Management,Reading Materials,R.Baeza-Yates,B.Ribeiro-Neto,Modern Information Retrieval,A
3、CM Press,1999,TREC:Experiment and Evaluation in Information Retrieval,The MIT Press 2005K.S.Jones,P.Willett,Readings in Information Retrieval,Morgan Kaufmann,1997Proceedings SIGIR,SIGMOD,WWW,课程安排,课程安排,http:/联系方式:杜小勇(信息楼 0459)文继荣,Introduction to IR,Prof.Xiaoyong Du,What is Information Retrieval,Defin
4、ition from comparison,What is IR?,Definition by examples:Library system,like CALISSearch engine,like Google,Baidu,What is IR?,Definition by contentIR=,whereD:document collectionQ:Users queryR:the relevance/similarity degree between query qi and document dj,IR System Architecture,UnstructuredData,Ind
5、ex,Indexer,Ranker,Classical IR,Crawler,UserInterface,WEB,WEB IR,Query logFeedback,Extractor,Data Miner,Content,IR ModelSystem architectureEvaluation and benchmarkKey techniquesMedia-related operatorsIndexingIEClassification and clusteringLink analysisRelevance evaluation,Related Area,Natural Languag
6、e ProcessingLarge-scale distributed computingDatabaseData miningInformation ScienceArtificial Intelligence,Model,IR Model,RepresentationHow to represent document/queryBag-of-wordSequence-of-wordLink of documentsSemantic NetworkSimilarity/relevance Evaluationsim(dj,q)=?,两大类的模型,基于文本内容的检索模型布尔模型向量空间模型概率
7、模型统计语言模型与内容无关的其他检索模型基于协同的模型基于链接分析的模型基于关联的模型,Classical IR Models-Basic Concepts,Bag-of-Word ModelEach document represented by a set of representative keywords or index termsThe importance of the index terms is represented by weights associated to themLet ki:an index termdj:a documentt:the total numbe
8、r of docsK=k1,k2,kt:the set of all index terms,Classic IR Models-Basic Concepts,wij=0:a weight associated with(ki,dj)The weight wij quantifies the importance of the index term for describing the document contentswij=0 indicates that term does not belong to docvec(dj)=(w1j,w2j,wtj):a weighted vector
9、associated with the document djgi(vec(dj)=wij:a function which returns the weight of term ki in document dj,Classical IR Models-Basic Concepts,A ranking is an ordering of the documents retrieved that(hopefully)reflects the relevance of the documents to the user query A ranking is based on fundamenta
10、l premises regarding the notion of relevance,such as:common sets of index termssharing of weighted termslikelihood of relevanceEach set of premises leads to a distinct IR model,The Boolean Model,Simple model based on set theoryQueries specified as boolean expressions precise semanticsneat formalismq
11、=ka(kb kc)Terms are either present or absent.Thus,wij 0,1Considerq=ka(kb kc)vec(qdnf)=(1,1,1)(1,1,0)(1,0,0)vec(qcc)=(1,1,0)is a conjunctive component,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),The Boolean Model,q=ka(kb kc)sim(q,dj)=1 if vec(qcc)|(vec(qc
12、c)/in vec(qdnf)(ki,gi(vec(dj)=gi(vec(qcc)0 otherwise,Drawbacks of the Boolean Model,Exact matchingNo ranking:Awkward:Information need has to be translated into a Boolean expression Too simple:The Boolean queries formulated by the users are most often too simplisticUnsatisfiable Results:The Boolean m
13、odel frequently returns either too few or too many documents in response to a user query,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),The Vector Model,Non-binary weights provide consideration for partial matchesThese term weights are used to compute a deg
14、ree of similarity between a query and each documentRanked set of documents provides for better matching,The Vector Model,Define:wij 0 whenever ki djwiq=0 associated with the pair(ki,q)vec(dj)=(w1j,w2j,.,wtj)vec(q)=(w1q,w2q,.,wtq)index terms are assumed to occur independently within the documents,Tha
15、t means the vector space is orthonormal.The t terms form an orthonormal basis for a t-dimensional spaceIn this space,queries and documents are represented as weighted vectors,The Vector Model,Sim(q,dj)=cos()=vec(dj)vec(q)/(|dj|*|q|)=wij*wiq/(|dj|*|q|)Since wij 0 and wiq 0,0=sim(q,dj)=1A document is
16、retrieved even if it matches the query terms only partially,i,j,dj,q,The Vector Model,Sim(q,dj)=wij*wiq/(|dj|*|q|)The KEY is to compute the weights wij and wiq?A good weight must take into account two effects:quantification of intra-document contents(similarity)tf factor,the term frequency within a
17、documentquantification of inter-documents separation(dissimilarity)idf factor,the inverse document frequency TF*IDF formular:wij=tf(i,j)*idf(i),The Vector Model,Let,N be the total number of docs in the collectionni be the number of docs which contain kifreq(i,j)raw frequency of ki within djA normali
18、zed tf factor is given bytf(i,j)=freq(i,j)/max(freq(l,j)where kl djThe idf factor is computed asidf(i)=log(N/ni)the log is used to make the values of tf and idf comparable.It can also be interpreted as the amount of information associated with the term ki.,The Vector Model,tf-idf weighting schemewij
19、=tf(i,j)*log(N/ni)The best term-weighting schemes For the query term weights,wiq=(0.5+0.5*freq(i,q)/max(freq(l,q)*log(N/ni)Or specified by the userThe vector model with tf-idf weights is a good ranking strategy with general collectionsThe vector model is usually as good as the known ranking alternat
20、ives.It is also simple and fast to compute.,The Vector Model,Advantages:term-weighting improves quality of the answer setpartial matching allows retrieval of docs that approximate the query conditionscosine ranking formula sorts documents according to degree of similarity to the queryDisadvantages:a
21、ssumes independence of index terms,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),Probabilistic Model,Objective:to capture the IR problem using a probabilistic frameworkGiven a user query,there is an ideal answer setQuerying as specification of the properti
22、es of this ideal answer set(clustering)But,what are these properties?Guess at the beginning what they could be(i.e.,guess initial description of ideal answer set)Improve by iteration,Probabilistic Model,Baisc ideas:An initial set of documents is retrieved somehow User inspects these docs looking for
23、 the relevant ones(in truth,only top 10-20 need to be inspected)IR system uses this information to refine description of ideal answer setBy repeting this process,it is expected that the description of the ideal answer set will improveDescription of ideal answer set is modeled in probabilistic terms,
24、Probabilistic Ranking Principle,The probabilistic model tries to estimate the probability that the user will find the document dj interesting(i.e.,relevant).The model assumes that this probability of relevance depends on the query and the document representations only.Let R be the Ideal answer set.B
25、ut,how to compute probabilities?what is the sample space?,The Ranking,Probabilistic ranking computed as:sim(q,dj)=P(dj relevant-to q)/P(dj non-relevant-to q)Definition:wij 0,1P(R|vec(dj):probability that given doc is relevantP(R|vec(dj):probability doc is not relevant,The Ranking,sim(dj,q)=P(R|vec(d
26、j)/P(R|vec(dj)=P(vec(dj)|R)*P(R)P(vec(dj)|R)*P(R)P(vec(dj)|R)P(vec(dj)|R)P(vec(dj)|R):probability of randomly selecting the document dj from the set R of relevant documents,The Ranking,sim(dj,q)P(vec(dj)|R)P(vec(dj)|R)P(ki|R)*P(ki|R)P(ki|R)*P(ki|R)P(ki|R):probability that the index term ki is presen
27、t in a document randomly selected from the set R of relevant documents,The Ranking,sim(dj,q)log P(ki|R)*P(kj|R)P(ki|R)*P(kj|R)K*log P(ki|R)+P(ki|R)log P(kj|R)P(kj|R)wiq*wij*(log P(ki|R)+log P(kj|R)P(ki|R)P(kj|R)where P(ki|R)=1-P(ki|R)P(ki|R)=1-P(ki|R),The Initial Ranking,sim(dj,q)wiq*wij*(log P(ki|R
28、)+log P(ki|R)1-P(ki|R)1-P(ki|R)How to compute Probabilities P(ki|R)and P(ki|R)?Estimates based on assumptions:P(ki|R)=0.5P(ki|R)=ni Nwhere ni is the number of docs that contain kiUse this initial guess to retrieve an initial rankingImprove upon this initial ranking,Improving the Initial Ranking,sim(
29、dj,q)wiq*wij*(log P(ki|R)+log P(ki|R)1-P(ki|R)1-P(ki|R)LetV:set of docs initially retrievedVi:subset of docs retrieved that contain kiRe-evaluate estimates:P(ki|R)=Vi VP(ki|R)=ni-ViN-VRepeat recursively,Improving the Initial Ranking,sim(dj,q)wiq*wij*(log P(ki|R)+log P(ki|R)1-P(ki|R)1-P(ki|R)To avoid
30、 problems with V=1 and Vi=0:P(ki|R)=Vi+0.5 V+1P(ki|R)=ni-Vi+0.5 N-V+1Also,P(ki|R)=Vi+ni/N V+1P(ki|R)=ni-Vi+ni/N N-V+1,Discussion,Advantages:Docs ranked in decreasing order of probability of relevanceDisadvantages:need to guess initial estimates for P(ki|R)method does not take into account tf and idf
31、 factors,Outline,Boolean Model(BM)Vector Space Model(VSM)Probabilistic Model(PM)Language Model(LM),Document Representation,Bag-of-wordsBag-of-factsBag-of-sentencesBag-of-nets,Document=Bag of WordsDocument=Bag of Sentences,Sentence=word sequence p(南京市长)p(江大桥|南京市长)p(中国大学人民),What is a LM?,“语言”就是其字母表上的某
32、种概率分布,该分布反映了任何一个字母序列成为该语言的一个句子(或其他任何的语言单元)的可能性,称这个概率分布为语言模型。给定的一个语言,对于一个语言“句子”(符号串),可以估计其出现的概率。例如:假定对于英语,p1(a quick brown dog)p2(dog brown a quick)p3(brown dog 棕熊)p4(棕熊)若p1=p2,称为一阶语言模型,否则称为高阶语言模型,Basic Notation,M:language we are try to model,it can be thought as a sources:observation(string of token
33、s)P(s|M):probability of observation“s”in M,that is the probability of getting“s”during random sampling from M,Basic Notation,Let S=s1s2.sn be any sentenceP(S)=P(s1)P(s2|s1).P(sn|s1,s2sn)Under n-gram model P(si|s1,s2si-1)=P(si|si-n+1,si-1)n=1,ungram P(si|s1,s2,si-1)=P(si),How can we use LMs in IR,Use
34、 LM to model the process of query generation:Every document in a collection defines a“language”P(s|MD)defines the probability that author would write down string”s”Now suppose“q”is the users queryP(q|MD)is the probability of“q”during random sampling from the D,and can be treated as rank of document
35、D in the collection,Major issues in applying LMs,What kind of language model should we use?Unigram or high-order modelsHow can we estimate model parameters?Basic model or advanced modelData smoothing approaches,What kind of models is better?,Unigram modelBigram modelHigh-order model,Unigram LMs,Word
36、s are“sampled”independently of each otherJoint probability decomposes into a production of marginalsP(xyz)=p(x)p(y)p(z)P(“brown dog”)=P(“dog”)P(“brown”)Estimation of probability:simple counting,Higher-order Models,n-gram:condition on preceding wordsCache:condition on a windowGrammar:condition on a p
37、arse treeAre they useful?Parameter estimation is very expensive!,Comparison,Song 和Croft指出,把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型好8%左右。不过,Victor Lavrenko指出,Song 和Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。David 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。也有研究认为词序对于检索结果影响不大.,Major issues in applying LMs,What kind of language mo
38、del should we use?Unigram or high-order modelsHow can we estimate model parameters?Basic model or advanced modelData smoothing approaches,Estimation of parameter,Given a string of text S(=Q or D),estimate its LM:MsBasic LMsMaximum-likelihood estimationZero-frequency problemDiscounting technologyInte
39、rpolation method,Maximum-likelihood estimation,Let V be vocabulary of M,Q=q1q2qm be a query,qi in V,S=d1d2dn be a doc.Let Ms be the language model of SP(Q|Ms)=?,called query likelihoodP(Ms|Q)=P(Q|Ms)P(Ms)/P(Q)can be treated as the ranking of doc S.P(Q|Ms)P(Ms)Estimating P(Q|Ms),and P(Ms),Maximum-lik
40、elihood estimation,估计P(Q|Ms)的方法:Multivarint Bernouli modelMultinomial modelBernouli model只考虑词是否在查询中出现,而不考虑出现几次。查询被看成是|v|个相互独立的Bernouli试验的结果序列P(Q|Ms)=wQ P(w|Ms)wQ(1-P(w|Ms),Maximum-likelihood estimation,Multinomial model(多项式模型)将查询被看成是多项试验的结果序列,因此考虑了词在查询中出现的次数。P(Q|Ms)=qiQ P(qi|Ms)=wQ P(w|Ms)#(w,Q)上述两种
41、办法都将转换成对P(w|Ms)的估计,也就是将IR问题转换成对文档语言模型的估计问题。从而可以利用LM的研究成果。,Maximum-likelihood estimation,最简单的办法就是采用极大似然估计:Count relative frequencies of words in S P(w|Ms)=#(w,S)/|S|0-frenquency problem(由数据的稀疏性造成)Assume some event not occur in S,then the probability is 0!It is not correct,and we need to avoid it,Disc
42、ounting Method,Laplace correction(add-one approach):Add 1 to every count,(normalize)P(w|Ms)=(#(w,S)+1)/(|S|+|V|)Problematic for large vocabularies(|V|太大的时候)Ref.Chen SF and Goodman JT:an empirical study of smoothing technology for language modeling,proc.34th annual meeting of the ACL,1996,Smoothing m
43、ethods,Additive smoothing methodsJelinek-Mercer 方法Dirichlet 方法,Additive smoothing methods,PML(s|Ms)=#(w,S)+c/|S|+c|V|When c=1,it is laplace smoothing method,Jelinek-Mercer 方法,Discounting 方法对待所有未出现的词是一样的,但实际上,仍然有不同,可以使用一些背景知识(或者说是一阶ML),例如利用英语语言知识。P(S|Ms)=cPML(S|Ms)+(1-c)P(S)=PML(S|Ms)+&P(S)PML(S|Ms)为
44、条件概率,P(S)=P(S|REF)为先验概率Set c to be a constant,independent of document and query.,平滑对检索性能的影响,Zhai CX,Lafferty J,A study of smoothing methods for language models applied to ad hoc information retrieval.ACM SIGIR 2001Zhai CX Lafferty J,A study of smoothing methods for language models applied to informa
45、tion retrieval.ACM TOIS 22(2)179-214平滑有两个作用:一是估计,解决0概率问题,二是查询建模,消除或者降低噪音的影响,Translation Models,Basic LMs do not address word synonymy.P(q|M)=w P(w|M)P(q|w)P(q|w)就是q和w之间的关系。如果q和w是近似词,这个值就比较大。P(q|w)可以依据词的共现关系/相同词根/词典等进行计算,这是该方法的关键P(w|M)就是语言模型下w的概率。,LM Tools,LEMURwww.cs.cmu.edu/lemurCMU/UMass joint pro
46、jectC+,good documentation,forum-based supportAd-hoc IR,Clustering,Q/A systemsML+smoothing,YARIAd-hoc IR,cross-language,classificationML+smoothing,Other applications of LM,Topic detection and trackingTreat“q”as a topic descriptionClassification/filteringCross-language retrievalMulti-media retrieval,R
47、eferences,Ponte JM,Croft WB,A Language Modeling approach to Information Retrieval,ACM SIGIR 1998,pp275-281Ponte JM,A Language Modeling approach to Information Retrieval,PhD Dissertation,UMass,1998,Bag-of-nets,如果文本的概念用本体来表达,也就是将从文本中抽取出的概念放在领域本体的背景下,形成一个概念的网络,情况将如何呢?可否利用Bayesian Network?关键是怎么理解词与词之间的关
48、系,是否具有因果关系?比如上下位关系?关联关系?,与内容无关的其他检索模型基于协同的模型基于链接分析的模型基于关联的模型通常与基于内容的模型一起使用,Collaborative Recommendation,raj denotes the score of item j rated by an active user a.If user a had not rated item j,raj=0.m-total number of users,n-total number of items.,协同推荐模型,For a given user-a and document-j,Predicate p
49、aj=?is the number of users who are similar to user a and have rated item j.w(a,i):The weight of the similarity between user a and user i.k is a normalizing factor such that the absolute values of the weights sum to unity.,算法主要的问题,冷启动(cold star)稀疏性(sparse)高维性(high dimension),基于分类的协同过滤推荐,解决冷启动问题基本思想:(
50、1)对矩阵进行划分,依据资源的语义分类(2)根据划分后的子矩阵进行协同过滤(3)生成预测结果,基于分类的协同过滤推荐,基本思想:(1)把每一项资源归到一个或几个类别中;(2)用户对资源评价矩阵进行分解,,(3)对 进行裁减,去掉对该类资源没有打分的用户,基于分类的协同过滤算法(续),(4)根据 计算用户在某一类别中的相似度,即得到一个用户的最邻近邻居们。(5)计算用户对特定类别中的资源感兴趣度(6)综合用户在多个类别中的感兴趣程度,得到最终推荐结果。,基于聚类的协同过滤算法,基本思想:(1)对矩阵进行划分划分根据稀疏矩阵聚类、K-Means等聚类算法(2)根据划分后的子矩阵进行协同过滤(3)生