实验算法部分.doc_三一办公31ppt.com

资源描述

《实验算法部分.doc》由会员分享，可在线阅读，更多相关《实验算法部分.doc（11页珍藏版）》请在三一办公上搜索。

1、信息处理模块要求：(1) 利用自然语言处理、数据挖掘技术对爬去的网站、论坛、博客、微博等进行文本挖掘，从各种网络信息中准确提取出用户关心事件的时间、地点、主体、行为和客体等要素；(2) 分析用户对事件要素的态度，构成特定的用户关注及态度模型。对大量不同类型网络信息的挖掘将形成具有差异性的模型库，同时也可以通过用户问卷调查的方式，获取更多具有差异性的关注模型作为补充。1. 自然语言处理工具包：fudanNLPfudanNLP主要是为中文自然语言处理而开发的工具包，也包含为实现这些任务的机器学习算法和数据集。本工具包及其包含数据集使用LGPL3.0许可证。开发语言为Java。主要功能有：1. 文

2、本分类新闻聚类；2. 中文分词词性标注实体名识别关键词抽取依存句法分析时间短语识别；3. 结构化学习在线学习层次分类聚类精确推理。InputStr1 = 甬温线特别重大铁路交通事故车辆经过近24小时的清理工作，26日深夜已经全部移出事故现场，之前埋下的D301次动车车头被挖出运走;InputStr2 = 甬温线|特别|重大|铁路交通事故车辆经过近24小时的清理工作，26日深夜已经全部移出事故现场，之前埋下的D301次动车车头被挖出运走;抽取top10:Output1 = 甬温线=100, 运走=100, 事故=52, 工作=41, 深夜=36, 清理=36, 全部=33,

3、小时=30, 移出=30, 车辆=26;import java.util.ArrayList; Import java.util.Map; import org.fnlp.app.keyword.AbstractExtractor; import org.fnlp.app.keyword.WordExtract; import .tag.CWSTagger; import edu.fudan.nlp.corpus.StopWords; public class GetKeywords public ArrayList GetKeyword(String News,int keywordsNum

4、ber) throws Exception ArrayList keywords=new ArrayList(); StopWords sw= new StopWords(models/stopwords); CWSTagger seg = new CWSTagger(models/seg.m); AbstractExtractor key = new WordExtract(seg,sw); Map ans = key.extract(News, keywordsNumber); for (Map.Entry entry : ans.entrySet() String keymap = en

5、try.getKey().toString(); String value = entry.getValue().toString(); keywords.add(keymap); System.out.println(key + keymap + value + value); return keywords; 输出结果是这样：2. 关键字提取后对文本进行分类第一步，对文档进行预处理过程。按照文本文档数据集(一般分目录放置文本文档)路径对所有训练文档扫描，分析出不同的单词。第二步，建立词频矩阵。预处理之后，将文章变为一个词集，单词也称为特征项或属性。把文档看成是一个词向量(wordvect

6、or)，它的维数是所有不同的单词个数，词集中可以有数万个不同的单词。第三步，构造文本分类器。词频统计矩阵是算法建模的基础。在词频统计矩阵的基础上根据特定的算法构造分类器。主要任务是根据不同分类算法，计算词向量的权值。目前较为著名的文本分类算法包括支持向量机(SupportVectorMachine,SVM),K近邻法(K-NearestNeighbour,KNN),朴素贝叶斯法(NaiveBayes,NB),神经网络法(NeuralNetwork,NNet),线性最小二乘法(LinearLeastSquaresFit,LLSF)等在本次实验中我组将打算用神经网络法。其代码如下。ackagec

7、om.mfsoft.ai.algorithm.imp;publicclassRbfNetextendsObjectintinNum;/输入接点数inthideNum;/隐含接点数intoutNum;/输出接点数doublec;/重心doubled;/距离（歪)intepochs;doublex;/输入向量doublex1;/隐含接点状态值doublex2;/输出接点状态值doubleo1;doubleo2;doublew;/隐含接点权值doublew1;/输出接点权值doublerate_w;/权值学习率（输入层-隐含层)doublerate_w1;/权值学习率(隐含层-输出层)doubler

8、ate_b1;/隐含层阀值学习率doublerate_b2;/输出层阀值学习率doubleb1;/隐含接点阀值Double b2;/输出接点阀值doublepp;doubleqq;doubleyd;doublee;doublein_rate;/输入归一化比例系数publicRbfNet(intinNum,inthideNum,intoutNum,doublep)in_rate=1.0;/输入归一化系数/*doublepmax=0.0;for(intisamp=0;isampp.length;isamp+)for(inti=0;ipmax)pmax=Math.abs(pisamp);/endfo

9、risampin_rate=pmax;for(intisamp=0;isampp.length;isamp+)for(inti=0;iinNum;i+)pisamp=pisamp/in_rate;/endforisamp3. 文本聚类文档聚类可以作为多文档自动文稿等自然语言处理应用的预处理步骤。其实现算法有Apriori算法等。 Apriori算法的基本思想是：首先找出所有的频集，这些项集出现的频繁性至少和预定义的最小支持度一样。然后由频集产生强关联规则，这些规则必须满足最小支持度和最小可信度。然后使用第1步找到的频集产生期望的规则，产生只包含集合的项的所有规则，其中每一条规则的右部只有一项，

10、这里采用的是中规则的定义。一旦这些规则被生成，那么只有那些大于用户给定的最小可信度的规则才被留下来。为了生成所有频集，使用了递归的方法。其代码如下：package apr;import java.io.BufferedWriter;import java.io.FileWriter;import java.util.*;public class Apriori private double minsup = 0.6;/ 最小支持度 private double minconf = 0.2;/ 最小置信度 / 注意使用IdentityHashMap，否则由于关联规则产生存在键值相同的会出现覆盖

11、private IdentityHashMap ruleMap = new IdentityHashMap(); private String transSet = abc, abc, acde, bcdf, abcd, abcdf ;/ 事务集合，可以根据需要从构造函数里传入 private int itemCounts = 0;/ 候选1项目集大小,即字母的个数 private TreeSet frequencySet = new TreeSet40;/ 频繁项集数组，0:代表1频繁集. private TreeSet maxFrequency = new TreeSet();/ 最大频繁

12、集 private TreeSet candidate = new TreeSet();/ 1候选集 private TreeSet candidateSet = new TreeSet40;/ 候选集数组 private int frequencyIndex; public Apriori() maxFrequency = new TreeSet(); itemCounts = counts();/ 初始化1候选集的大小 / 初始化其他两个 for (int i = 0; i itemCounts; i+) frequencySeti = new TreeSet(); candidateSe

13、ti = new TreeSet(); candidateSet0 = candidate; public Apriori(String transSet) this.transSet = transSet; maxFrequency = new TreeSet(); itemCounts = counts();/ 初始化1候选集的大小 / 初始化其他两个 for (int i = 0; i itemCounts; i+) frequencySeti = new TreeSet(); candidateSeti = new TreeSet(); candidateSet0 = candidat

14、e; public int counts() String temp1 = null; char temp2 = a; / 遍历所有事务集String 加入集合，set自动去重了 for (int i = 0; i transSet.length; i+) temp1 = transSeti; for (int j = 0; j = minsup * transSet.length) frequencySet0.add(temp1); public double count_sup(String x) int temp = 0; for (int i = 0; i transSet.lengt

15、h; i+) for (int j = 0; j = c2) continue; else m = y + z; h.add(m); temp2 = frequencySet0.iterator(); candidateSetk - 1 = h; / k候选集=k频繁集 public void frequent_gen(int k) String s1 = ; Iterator ix = candidateSetk - 1.iterator(); while (ix.hasNext() s1 = (String) ix.next(); if (count_sup(s1) = (minsup *

16、 transSet.length) frequencySetk - 1.add(s1); public boolean is_frequent_empty(int k) if (frequencySetk - 1.isEmpty() return true; else return false; public boolean included(String s1, String s2) for (int i = 0; i s1.length(); i+) if (s2.indexOf(s1.charAt(i) = -1) return false; else if (i = s1.length

17、() - 1) return true; return true; public void maxfrequent_gen() int i, j; Iterator iterator, iterator1, iterator2; String temp = , temp1 = , temp2 = ; for (i = 1; i frequencyIndex; i+) maxFrequency.addAll(frequencySeti); / for (i = 0; i frequencyIndex; i+) / iterator1 = frequencySeti.iterator(); / w

18、hile (iterator1.hasNext() / temp1 = (String) iterator1.next(); / for (j = i + 1; j frequencyIndex; j+) / iterator2 = frequencySetj.iterator(); / while (iterator2.hasNext() / temp2 = (String) iterator2.next(); / if (included(temp1, temp2) / maxFrequency.remove(temp1); / / / / public void print_maxfre

19、quent() Iterator iterator = maxFrequency.iterator(); System.out.print(产生规则频繁项集：); while (iterator.hasNext() System.out.print(toDigit(String) iterator.next() + t); System.out.println(); public void rulePrint() String x, y; double temp = 0; Set hs = ruleMap.keySet(); Iterator iterator = hs.iterator();

20、 StringBuffer sb = new StringBuffer(); System.out.println(关联规则：); while (iterator.hasNext() x = (String) iterator.next(); y = (String) ruleMap.get(x); temp = (count_sup(x + y) / count_sup(x); /x = toDigit(x); /y = toDigit(y); System.out.println(x + (x.length() + y + t + temp); sb.append( + x + (x.le

21、ngth() + y + t + temp + tn); BufferedWriter bw = null; try FileWriter fw = new FileWriter(Asr.txt); bw = new BufferedWriter(fw); bw.write(最小支持度 minsup = + minsup); bw.newLine(); bw.write(最小置信度 minconf = + minconf); bw.newLine(); bw.write(产生关联规则如下: ); bw.newLine(); bw.write(sb.toString(); / bw.newLin

22、e(); if (bw != null) bw.close(); catch (Exception e) e.printStackTrace(); public void subGen(String s) String x = , y = ; for (int i = 1; i (1 s.length() - 1; i+) for (int j = 0; j s.length(); j+) if (1 j) & i) != 0) x += s.charAt(j); for (int j = 0; j s.length(); j+) if (1 = minconf) ruleMap.put(x,

23、 y); x = ; y = ; public void ruleGen() String s; Iterator iterator = maxFrequency.iterator(); while (iterator.hasNext() s = (String) iterator.next(); subGen(s); / for test public void print1() Iterator temp = candidateSet0.iterator(); while (temp.hasNext() System.out.println(temp.next(); / for test

24、public void print2() Iterator temp = frequencySet0.iterator(); while (temp.hasNext() System.out.println(String) temp.next(); / for test public void print3() canditate_gen(1); frequent_gen(2); Iterator temp = candidateSet1.iterator(); Iterator temp1 = frequencySet1.iterator(); while (temp.hasNext() S

25、ystem.out.println(候选 + (String) temp.next(); while (temp1.hasNext() System.out.println(频繁 + (String) temp1.next(); public void print_canditate() for (int i = 0; i frequencySet0.size(); i+) Iterator ix = candidateSeti.iterator(); Iterator iy = frequencySeti.iterator(); System.out.print(候选集 + (i + 1)

26、+ :); while (ix.hasNext() System.out.print(String) ix.next() + t); /System.out.print(toDigit(String) ix.next() + t); System.out.print(n + 频繁集 + (i + 1) + :); while (iy.hasNext() System.out.print(String) iy.next() + t); /System.out.print(toDigit(String) iy.next() + t); System.out.println(); private S

27、tring toDigit(String str) if (str != null) StringBuffer temp = new StringBuffer(); for (int i = 0; i str.length(); i+) char c = str.charAt(i); temp.append(int) c - 65) + ); return temp.toString(); else return null; public String getTrans_set() return transSet; public void setTrans_set(String transSe

28、t) transSet = transSet; public double getMinsup() return minsup; public void setMinsup(double minsup) this.minsup = minsup; public double getMinconf() return minconf; public void setMinconf(double minconf) this.minconf = minconf; public void run() int k = 1; item1_gen(); do k+; canditate_gen(k); frequent_gen(k); while (!is_frequent_empty(k); frequencyIndex = k - 1; print_canditate(); maxfrequent_gen(); print_maxfrequent(); ruleGen(); rulePrint(); public static void main(String args) Apriori ap = new Apriori(); ap.run(); 在对微博进行处理后将数据储存在数据库中。

展开阅读全文