Weka中贝叶斯网络学习情况小结.docx

资源描述

《Weka中贝叶斯网络学习情况小结.docx》由会员分享，可在线阅读，更多相关《Weka中贝叶斯网络学习情况小结.docx（31页珍藏版）》请在三一办公上搜索。

1、Weka中贝叶斯网络学习情况小结Weka中贝叶斯网络学习情况小结 Weka中对于贝叶斯网络的学习，仅仅看相关的几个包几乎是不可能的，结果还是一叶障目不见泰山。后来发现就那些代码死磕根本不行，还得采取灵活的方式方法。一方面采用学习人家博客的总结，逐步梳理weka中那些类的组织和功能函数的情况。一方面下载了一个weka源代码分析的中文翻译包，能够从更加宽的领域理解BN的相关重要函数以及操作。随便找个类，比如instance和instances，比如evaluation，都是几百行，或者千余行，通读一遍都很费力，而且许多类有继承，许多地方用的是接口，很少有地方用抽象类。各个类之间的关联较多，随便

2、引用一下，在某个函数中出现一下，都是关系，本来以为UML图能够方便的解决问题，但是拖动以后发现，事情越高越多，线条也越来越多，实际上展示效力快速下降，到最后还是不能够说明问题。刚开始研究原理，其实朴素贝叶斯和贝叶斯网络的基本原理并不复杂，小型网络手算是可以的，有点像矩阵，小矩阵加减乘除都没问题，但是大型矩阵就得想方设法用计算机、写代码来实现了。数学上涉及的东西，写道计算机上就需要一些辅助的东西了。特别是程序实现上。事实上，学习概论的东西本身也要小心，一不留神还是会犯下各种错误。在软件实现上，读入arff数据，构建ADTree，然后训练分类器，最后进行测试。 An alternating d

3、ecision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting. An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number.

4、ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. This is different from binary classification trees such as CART (Classification and r

5、egression tree) or C4.5 in which an instance follows only one path through the tree. http:/en.wikipedia.org/wiki/ADTree这个网站给的例子可以好好理解一下、 weka在进行实例测试的时候也有许多术语和内容需要继续认识，比如recall，precision，confusion matrix，以及其他指标。评分函数，在软件中归在evaluation中。CPT，也就是conditional probability table也是建立网络的关键。有一个说法：Bays = Bs(DAG)

6、+Bp(CPT) weka 中的Beiyes有两个要求，一个是离散化的数据，另一个是数据的值不能是null 整个学习的过程是先构建DAG在学习出CPT。记录一点代码分析：在buildClassifier函数中，重要的几行是： / build the network structure initStructure; / build the network structure buildStructure; / build the set of CPTs estimateCPTs; 函数initStructure为初始化网络结构，buildStructure为构造网络结构，estimateCP

7、Ts为计算条件概率表(conditional probability table)。 /* * Init structure initializes the structure to an empty graph * or a Naive Bayes graph (depending on the -N flag). */ public void initStructure throws Exception / reserve memory m_ParentSets = new ParentSetm_Instances.numAttributes; for (int iAttribute =

8、0; iAttribute m_Instances.numAttributes; iAttribute+) m_ParentSetsiAttribute = new ParentSet(m_Instances .numAttributes); / initStructure m_ParentSets是记录下第i个属性(iAttribute)的父结点，ParentSet初始函数为： public ParentSet(int nMaxNrOfParents) m_nParents = new intnMaxNrOfParents; m_nNrOfParents = 0; m_nCardinalit

9、yOfParents = 1; / ParentSet 不做什么，也就是一个空图。接下来看buildStructure，它会调用SearchAlgorithm中的buildStructure： /* * buildStructure determines the network structure/graph of the network. * The default behavior is creating a network where all nodes have * the first node as its parent (i.e., a BayesNet that behaves

10、 like * a naive Bayes classifier). This method can be overridden by derived * classes to restrict the class of network structures that are acceptable. */ public void buildStructure(BayesNet bayesNet, Instances instances) throws Exception if (m_bInitAsNaiveBayes) int iClass = instances.classIndex; /

11、initialize parent sets to have arrow from classifier node to / each of the other nodes for (int iAttribute = 0; iAttribute instances.numAttributes; iAttribute+) if (iAttribute != iClass) bayesNet.getParentSet(iAttribute).addParent(iClass, instances); search(bayesNet, instances); if (m_bMarkovBlanket

12、Classifier) doMarkovBlanketCorrection(bayesNet, instances); / buildStructure 这里会判断是不是初始化成朴素贝叶斯，如果不初始化为朴素贝叶斯，那么就还是空图，如果初始为朴素贝叶斯，则对于每个属性将类别属性加为父结点。addParent的代码如下： public void addParent(int nParent, Instances _Instances) if (m_nNrOfParents = 10) / reserve more memory int nParents = new int50; for (int

13、i = 0; i m_nNrOfParents; i+) nParentsi = m_nParentsi; m_nParents = nParents; m_nParentsm_nNrOfParents = nParent; m_nNrOfParents+; m_nCardinalityOfParents *= _Instances.attribute(nParent) .numValues; / AddParent 前面的if是预保留内存的代码，后面的是保存哪个属性是它的父结点，m_NrOfParent是父结点数，CardinalityOfParents是父结点所能取的所有属性值之和。 se

14、arch函数的实现有很多，这里看K2的代码实现： int nOrder = new intinstances.numAttributes; nOrder0 = instances.classIndex; int nAttribute = 0; for (int iOrder = 1; iOrder instances.numAttributes; iOrder+) if (nAttribute = instances.classIndex) nAttribute+; nOrderiOrder = nAttribute+; nOrder中类别属性下标为0，其实它属性顺序还是一样的。 / dete

15、rmine base scores double fBaseScores = new doubleinstances.numAttributes; for (int iOrder = 0; iOrder instances.numAttributes; iOrder+) int iAttribute = nOrderiOrder; fBaseScoresiAttribute = calcNodeScore(iAttribute); 计算base scores，调用calcNodeScore函数： public double calcNodeScore(int nNode) if (m_Baye

16、sNet.getUseADTree & m_BayesNet.getADTree != null) return calcNodeScoreADTree(nNode); else return calcNodeScorePlain(nNode); ADTree就暂时不去理会了，看calcNodeScorePlain函数： / estimate distributions Enumeration enumInsts = instances.enumerateInstances; while (enumInsts.hasMoreElements) Instance instance = (Inst

17、ance) enumInsts.nextElement; / updateClassifier; double iCPT = 0; for (int iParent = 0; iParent oParentSet.getNrOfParents; iParent+) int nParent = oParentSet.getParent(iParent); iCPT = iCPT * instances.attribute(nParent).numValues + instance.value(nParent); nCountsnumValues * (int) iCPT) + (int) ins

18、tance.value(nNode)+; 这里的nCounts是文章Bayesian Network Classifiers in Weka中第4页所提到的Nijk，这里是将i,j,k三维放到了一些，类别值是最后的instance.value(nNode)。在calcNodeScorePlain函数中最后调用了calcScoreOfCount函数： for (int iParent = 0; iParent nCardinality; iParent+) switch (m_nScoreType) case (Scoreable.BAYES): double nSumOfCounts = 0

19、; for (int iSymbol = 0; iSymbol numValues; iSymbol+) if (m_fAlpha + nCountsiParent * numValues + iSymbol != 0) fLogScore += Statistics.lnGamma(m_fAlpha + nCountsiParent * numValues + iSymbol); nSumOfCounts += m_fAlpha + nCountsiParent * numValues + iSymbol; if (nSumOfCounts != 0) fLogScore -= Statis

20、tics.lnGamma(nSumOfCounts); if (m_fAlpha != 0) fLogScore -= numValues * Statistics.lnGamma(m_fAlpha); fLogScore += Statistics.lnGamma(numValues * m_fAlpha); 可以看Bayesian Network Classifiers in Weka第6页中的Bayesian metric中的公式，第一个for是计算Gamma(Nijk prime+ Nijk)。接下来是计算Gamma(Nijk prime+ Nij)，再将下来是计算Gamma(Nij

21、prime)/Gamma(Nij prime+ Nij)。 case (Scoreable.MDL): case (Scoreable.AIC): case (Scoreable.ENTROPY): double nSumOfCounts = 0; for (int iSymbol = 0; iSymbol numValues; iSymbol+) nSumOfCounts += nCountsiParent * numValues + iSymbol; for (int iSymbol = 0; iSymbol 0) fLogScore += nCountsiParent * numValu

22、es + iSymbol * Math.log(nCountsiParent * numValues + iSymbol / nSumOfCounts); 这里相应于Bayesian Network Classifiers in Weka第5页的公式(2)不同之处是没有N，因为它可以消掉。 switch (m_nScoreType) case (Scoreable.MDL): fLogScore -= 0.5 * nCardinality * (numValues - 1) * Math.log(instances.numInstances); break; case (Scoreable.A

23、IC): fLogScore -= nCardinality * (numValues - 1); break; 公式中的K=nCardinality * (numValues - 1)，N=instances.numInstances。见公式(3)，MDL和AIC的计算见公式(5)。 / K2 algorithm: greedy search restricted by ordering for (int iOrder = 1; iOrder instances.numAttributes; iOrder+) int iAttribute = nOrderiOrder; double fBe

24、stScore = fBaseScoresiAttribute; boolean bProgress = (bayesNet.getParentSet(iAttribute) .getNrOfParents getMaxNrOfParents); while (bProgress) int nBestAttribute = -1; for (int iOrder2 = 0; iOrder2 fBestScore) fBestScore = fScore; nBestAttribute = iAttribute2; if (nBestAttribute != -1) bayesNet.getPa

25、rentSet(iAttribute).addParent( nBestAttribute, instances); fBaseScoresiAttribute = fBestScore; bProgress = (bayesNet.getParentSet(iAttribute) .getNrOfParents getMaxNrOfParents); else bProgress = false; bProgress是判断iAttribute结点的父结点是否超过最大父结点数，代码逻辑大致是：对每个属性(iOrder)得到它的父结点，找它的父结点是在0-iOrder中找，不然就循环了。对于每个

26、属性调用calcScoreWithExtraParent函数计算得到，如果它比以前的结分高，那它成为最好的属性，调用addParent加入。 public double calcScoreWithExtraParent(int nNode, int nCandidateParent) ParentSet oParentSet = m_BayesNet.getParentSet(nNode); / sanity check: nCandidateParent should not be in parent set already if (oParentSet.contains(nCandidat

27、eParent) return -1e100; / set up candidate parent oParentSet.addParent(nCandidateParent, m_BayesNet.m_Instances); / calculate the score double logScore = calcNodeScore(nNode); / delete temporarily added parent oParentSet.deleteLastParent(m_BayesNet.m_Instances); return logScore; / CalcScoreWithExtra

28、Parent 这个函数名已经反映出函数的内容，它将要计算的结点(nCandidateParent)先加入到父结点集合(oParentSet)中，调用calcNodeScore计算得分，这个函数已经看过了，接下来再删除刚加入的父结点，也就是还原以前的父结点集合。再看estimateCPTs函数，这里看默认的SimpleEstimator函数： /* * estimateCPTs estimates the conditional probability tables for the Bayes * Net using the network structure. */ public void

29、estimateCPTs(BayesNet bayesNet) throws Exception initCPTs(bayesNet); / Compute counts Enumeration enumInsts = bayesNet.m_Instances .enumerateInstances; while (enumInsts.hasMoreElements) Instance instance = (Instance) enumInsts.nextElement; updateClassifier(bayesNet, instance); / estimateCPTs 函数initC

30、PTs初始化CPT，再用增量方式学习CPT。 /* * initCPTs reserves space for CPTs and set all counts to zero */ public void initCPTs(BayesNet bayesNet) throws Exception Instances instances = bayesNet.m_Instances; / Reserve space for CPTs int nMaxParentCardinality = 1; for (int iAttribute = 0; iAttribute nMaxParentCardin

31、ality) nMaxParentCardinality = bayesNet.getParentSet(iAttribute) .getCardinalityOfParents; / Reserve plenty of memory bayesNet.m_Distributions = new Estimatorinstances.numAttributes nMaxParentCardinality; / estimate CPTs for (int iAttribute = 0; iAttribute instances.numAttributes; iAttribute+) for (

32、int iParent = 0; iParent bayesNet.getParentSet(iAttribute) .getCardinalityOfParents; iParent+) bayesNet.m_DistributionsiAttributeiParent = new DiscreteEstimatorBayes(instances.attribute( iAttribute).numValues, m_fAlpha); / initCPTs 每一个for循环是求出nMaxParentCardinality，也就是求得为m_Distributions分配的空间大小，将下来对每个

33、属性结点的父结点都进行初始化。 public void updateClassifier(BayesNet bayesNet, Instance instance) throws Exception for (int iAttribute = 0; iAttribute bayesNet.m_Instances .numAttributes; iAttribute+) double iCPT = 0; for (int iParent = 0; iParent bayesNet.getParentSet(iAttribute) .getNrOfParents; iParent+) int nP

34、arent = bayesNet.getParentSet(iAttribute).getParent( iParent); iCPT = iCPT * bayesNet.m_Instances.attribute(nParent) .numValues + instance.value(nParent); bayesNet.m_DistributionsiAttribute(int) iCPT .addValue(instance.value(iAttribute), instance.weight); / updateClassifier 与calcNodeScorePlain的计数方式相

35、似，只是这里用的是二维数组，接下来看DiscreteEstimate的初始化函数和addValue函数。 public DiscreteEstimatorBayes(int nSymbols, double fPrior) m_fPrior = fPrior; m_nSymbols = nSymbols; m_Counts = new doublem_nSymbols; for (int iSymbol = 0; iSymbol m_nSymbols; iSymbol+) m_CountsiSymbol = m_fPrior; m_SumOfCounts = m_fPrior * (doubl

36、e) m_nSymbols; / DiscreteEstimatorBayes 没什么特别的，初始化m_Count和m_SumOfCount。 public void addValue(double data, double weight) m_Counts(int) data += weight; m_SumOfCounts += weight; 很简单，将样本的权重累加到相应的元素上。现在看一下SimpleEstimator： for (int iClass = 0; iClass nNumClasses; iClass+) fProbsiClass = 1.0; 这段代码我有点疑问，因

37、为看它下面的代码注释，似乎它以前是用连乘方式计算，后来改用先求对数之和，再求幂。以前用连乘方式初始为1没问题，但如果先求对数，应该初始化为0才对。 for (int iClass = 0; iClass nNumClasses; iClass+) double logfP = 0; for (int iAttribute = 0; iAttribute instances.numAttributes; iAttribute+) double iCPT = 0; for (int iParent = 0; iParent bayesNet.getParentSet( iAttribute).ge

38、tNrOfParents; iParent+) int nParent = bayesNet.getParentSet(iAttribute).getParent( iParent); if (nParent = instances.classIndex) iCPT = iCPT * nNumClasses + iClass; else iCPT = iCPT * instances.attribute(nParent).numValues + instance.value(nParent); if (iAttribute = instances.classIndex) logfP += Ma

39、th .log(bayesNet.m_DistributionsiAttribute(int) iCPT.getProbability(iClass); else logfP += Math .log(bayesNet.m_DistributionsiAttribute(int) iCPT.getProbability(instance.value(iAttribute); fProbsiClass += logfP; 最内层循环是得到iCPT，即找到在m_Distribution第二维中的相应位置，代码可以参考第3页的公式(1)，而getProbability在DiscreteEstimatorBayes中的代码为： public double getProbability(double data) if (m_SumOfCounts = 0) / this can only happen if numSymbols = 0 in constructor retu

展开阅读全文