毕业论文（设计）平行语料库处理初探一种排序模型29338.doc

资源描述

《毕业论文（设计）平行语料库处理初探一种排序模型29338.doc》由会员分享，可在线阅读，更多相关《毕业论文（设计）平行语料库处理初探一种排序模型29338.doc（7页珍藏版）》请在三一办公上搜索。

1、平行语料库处理初探：一种排序模型* 本文的研究工作得到国家863计划项目（No.2004AA117010）和国家自然科学基金项目（No.60373080）陈毅东，男，1977，博士研究生，助教，研究方向：机器翻译史晓东，男，1966，博士，教授，研究方向：人工智能，计算语言学，机器翻译周昌乐，男，1959，博士，教授、博士生导师，研究方向：人工智能及其应用技术陈毅东史晓东周昌乐厦门大学计算机系厦门361005摘要十年来，统计方法在机器翻译中的应用得到了广泛的关注，并逐渐成为机器翻译研究的主流方法。构造高质量统计机器翻译系统的重要基础是大规模高质量的双语平行语料库。目前，多数平行语料库包含着错误

2、或噪音，它们极大影响着统计机器翻译系统的性能。用人工手段来筛选语料库中的句对是费时费力的，本文研究了一种有助于处理这一问题排序模型，该模型考虑了多方面的因素，包括：语言模型、长度信息、意义对应等。鉴于如今的统计机器翻译系统都依赖词对齐信息，词对齐因素也被考虑入本模型中。文章最后的实验及结果表明本模型具有较好的性能。关键词平行语料库语料库处理排序统计机器翻译Research on Filtering Parallel Corpus: A Ranking ModelChen YidongShi XiaodongZhou ChangleDepartment of Computer Science,

3、Xiamen University, Xiamen 361005AbstractIn the past ten years, statistical methods have been more and more popular in the research of Machine Translation. The performance of a Statistical Machine Translation system is dependent on many aspects, such as the translation model, the search strategy and

4、the parallel corpus. Specifically, parallel corpus has become an essential resource for the SMT system. Many parallel corpora contain errors and its tiring and time-consuming to filter bad sentence pairs out. In this paper, a model called ranking model that will help dealing with such problem was ad

5、dressed. In this model, both syntax features and semantics features of sentence pairs are considered. Since most current Statistical Machine Translation models depends on word alignment, features related to word alignment information are also included. At the end of this paper, an experiment was car

6、ried out and the results showed that our model had promising performance.Keywordsparallel corpora; corpus filtering; Ranking; statistical machine translation一、引言十年来，统计方法在机器翻译中的应用得到了广泛的关注，各种统计模型不断涌现，统计翻译系统的性能得到很大的提升。在近年来的NIST评测中，统计机器翻译系统取得了较好的成绩。统计方法已逐渐成为机器翻译研究的主流方法。构造高质量统计机器翻译系统的重要基础是大规模高质量的双语平行语料库

7、。目前，随着不断的积累，以及一些自动方法的引进1, 2，平行语料库的来源扩展，其规模已经较大，能基本满足要求，但它们的质量却并不高。多数平行语料库包含着大量错误，这些错误有构造平行语料库的原始语料中所存在的错误，如拼写错误、错别字、错误的译文等；也有在构造平行语料库过程中带进来的新错误，如段落对齐或句对齐错误而导致的错误翻译对等。所有这些错误都将影响训练结果的可靠性，进而影响翻译系统的性能。除包含错误以外，多数平行语料库中还包含着一些无法在如今的训练算法中起到贡献的句对。这些句对通常包含成语、特殊翻译方式等，它们本身没有错，也具有良好的互为翻译性，但是，目前低智能的学习方式非但不能从这些句对中

8、受益，反而会受到这些句对的干扰。因而，要构造高性能的统计机器翻译系统，这类句对也应被排除。依靠人工手段来排除上述错误或无用的句对，将是费时费力的，本文研究了一种有助于处理这一问题排序模型。文章余下部分的安排如下：第二部分给出排序模型的基本思路；第三部分介绍排序模型中考虑的主要特征函数；第四部分给出实验及结果；第五部分是结论。二、基本思路从双语平行语料库中排除上述错误或无用的句对，有多种方法，我们所采用的方式基于一个简单的思路8：按一定的准则对平行语料库中的句对进行排序，使较好的句对在排序后处于语料库的前端；之后，可以人工着重校对处于后端正的句对，或直接删除这些句对。这里，判断句对好坏的准则应

9、考虑如下三方面的因素：l 句对中包含的错误数l 句对的互为翻译性l 句对是否能在训练中做出好的贡献为了进一步以可量化的形式表达2.1的评判准则，我们可以引进一些具体的指标或特征函数。这样，给定句对好坏的量化指标可以由公式2.1计算： (2.1)其中，f和e分别代表给定句对的原文和译文句；函数Fi(f, e)提供了第i个特征的分值，而li则是相应特征的权重。三、特征函数和权重我们目前引进了如下四个特征l 语言模型指标语言模型指标在3中就被用来衡量输出句子的流畅性。一个流畅的句子应该包含更少的错误，因而，这一指标也可以反映出句子包含错误的多少。故我们将语言模型指标作为特征函数之一。具体地说，对于

10、给定的句对，相应的语言模型指标可以按公式3.1计算：(3.1)其中，函数len(s)将返回给定句子s的长度，主要用于消除句子长度的影响。l 句长匹配度指标句长匹配度指标在双语语料库的句子对齐模型中4被考虑。通常认为，给定的语言对里的翻译对的句长度应符合一定的比例。据我们观察，多数平行语料库仍包含长度不匹配的句对，而这些句对多数都包含着错误。为了排除这类句子，我们引进了句长匹配度指标，该指标可经由公式3.2计算：(3.2)其中，参数a1和a2可以根据所考虑的语言对变化。对于处理汉英平行语料库，a1和a2可分别被设置为0.5 and 1.2。l bPER指标本指标是用来判断给定句对互为翻译性的指标

11、，其主要思想是：如果一个句子中有越多的单词可以在另一个句子中找到对应的译文词，则这两个句子具有越好的互为翻译性。基于这一思路，多数包含错误及包含成语的句对都将获得低分。本指标类似于5中提出的与位置无关的错误率（PER，position-independent word error rate）指标，差别在于本指标同时考虑了两个语言的信息，因而我们称本指标为bPER指标，其中b-代表双语。公式3.3到3.5详细描述了本指标的计算：(3.3)(3.4)(3.5)其中，函数pos(w)返回给定词语w的词性，函数t(w, v)则检查给定的两个词w和v, 是否互为译文。注意到，在计算本特征时，我们仅仅考虑

12、给定句对中的部分词而不是全部。未被考虑的词多为功能词，在翻译时未必有对应的译文。l bWER指标本指标的功能和bPER指标类似，但它考虑了词对齐信息。之所以引进该指标，是因为，目前的多数统计机器翻译模型都依赖于词对齐信息。和bPER类似，我们将本指标命名bWER以显示它和7中提到的词错误率（WER，word error rate）指标的可类比关系。公式3.6到3.8描述了本特征的计算。在计算本特征之前，给定的句对应先使用诸如IBM模型3的词对齐模型进行词对齐以获取词对齐信息。(3.6)(3.7)(3.8)理论上说，本特征考虑了词对齐，可以防止误判，因而比bPER指标更精确。但实际上，由于词对齐

13、模型本身的精确性并不高，本指标所提供分值的可靠性反而较低。如此，在我们的排序模型中，本特征的相应权重设置得较低。确定了特征后，余下的工作就是确定相应的权重。权重可从相应训练集（如果有的话）训练获得，为简单起见，我们人工设定了这些权重值，具体如表3.1所示：权重对应特征Valuesl1语言模型指标0.2l2长度匹配度0.1l3bPER指标0.5l4bWER指标0.2表3.1采用的参数值四、实验与结果排序模型的评价是一个困难的问题，因而我们没有经过排序的语料库作为测试集，而构造一个这样的集合并不那么容易。为了检验我们的排序模型是否有效，我们构造一个间接检测的实验，该实验也显示了排序模型的一种应用

14、。实验的具体步骤描述如下：首先，我们在一个具有8万句对的汉英平行语料库上实施排序模型，并获得一个经过排序的平行语料库。第二，我们构造上述语料库的五个子集，均包含4万句对。具体描述如表4.1子集名说明G从排序后的语料库中按由前往后抽取的句对B.从排序后的语料库中按由后往前抽取的句对R1, R2, R3从排序后的语料库中随机抽取表4.1五个子集的说明第三，我们以这五个子集作为训练集分别训练我们的统计机器翻译引擎并获得了五个不同的统计机器翻译系统。最后，我们让这五个系统分别对2004年的863机器翻译测评数据进行处理并以BLEU和NIST指标分别评价其结果。最后的评价结果如表4.2所示：GBR1R2

15、R3BLEU0.05740.05270.05530.05540.0540NIST3.98513.64353.87553.88323.8545表4.2机器翻译评价结果注意到，由于五个系统的训练数据都较少（只有4万句对），表4.2中列出的成绩较低。但结果已清楚地表明了：使用训练集G训练得到的系统的结果比所有其它几个系统都要好；使用训练集B训练得到的系统的结果比所有其它几个系统都要差。这一结果可以说明两个问题：第一，我们所提出的排序模型是有效的，集合G里的句对整体上更好；第二，包含较多错误的平行语料库将影响统计机器翻译系统的翻译性能。五、结论本文提出了一种可以对平行语料库中的句对进行排序的排序模型

16、。该模型有助于平行语料库的筛选工作。模型考虑了多方面的因素，包括：语言模型、长度信息、意义对应等。鉴于如今的统计机器翻译系统都依赖词对齐信息，词对齐因素也被考虑入本模型中。最后的实验及其结果表明了该排序模型是有效的。本模型还是初步的，所考虑的因素还不全面，各权重目前也仅简单地按经验设置。下一步，我们将进一步完善排序模型，考虑更多的因素，比如词性的均衡性指标等。此外，我们还将考虑采用Bootstrapping方式来训练各特征的权重。参考文献1 Philip Resnik, and Noah A. Smith, “The Web as a Parallel Corpus”J, Computatio

17、nal Linguistics, Vol 29, No. 3, pp. 349-380, Sep. 2003.2 Dragos S. Munteanu, Alexander Fraser, and Daniel Marcu, “Improved machine translation performance via parallel sentence extraction from comparable corpora”A, Proceeding of the Human Language Technology Conference of the North American Chapter

18、of the Association for Computational Linguistics (HLT-NAACL 2004), Boston, MA, pp. 265-272, May 2004.3 Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robet L. Mercer, “The Mathematics of Statistical Machine Translation: Parameter Estimation”J, Computational Linguistics, Vol 19

19、, No. 2, pp. 263-311, 1993.4 William A. Gale, and Kenneth W. Church, “A Program for Aligning Sentences in Bilingual Corpora”J, Computational Linguistics, Vol 19, No. 1, pp. 75-102, Mar. 1993.5 Christoph Tillmann, Stephan Vogel, Hermann Ney, Alex Zubiaga, and Hassan Sawaf, “Accelerated DP based Searc

20、h for Statistical Translation”A, Proceeding of European Conference on Speech Communication and Technology, Rhodes, Greece, pp. 2667-2670, Sep. 1997.6 Sonja Nieben, Franz J. Och, G. Leusch, and Hermann Ney, “An Evaluation Tool for Machine Translation: Fast Evaluation for Machine Translation Research”

21、A, Proceeding of the Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece, pp. 39-45, May 2000.Editors note: Judson Jones is a meteorologist, journalist and photographer. He has freelanced with CNN for four years, covering severe weather from tornadoes to typho

22、ons. Follow him on Twitter: jnjonesjr (CNN) - I will always wonder what it was like to huddle around a shortwave radio and through the crackling static from space hear the faint beeps of the worlds first satellite - Sputnik. I also missed watching Neil Armstrong step foot on the moon and the first s

23、pace shuttle take off for the stars. Those events were way before my time.As a kid, I was fascinated with what goes on in the sky, and when NASA pulled the plug on the shuttle program I was heartbroken. Yet the privatized space race has renewed my childhood dreams to reach for the stars.As a meteoro

24、logist, Ive still seen many important weather and space events, but right now, if you were sitting next to me, youd hear my foot tapping rapidly under my desk. Im anxious for the next one: a space capsule hanging from a crane in the New Mexico desert.Its like the set for a George Lucas movie floatin

25、g to the edge of space.You and I will have the chance to watch a man take a leap into an unimaginable free fall from the edge of space - live.The (lack of) air up there Watch man jump from 96,000 feet Tuesday, I sat at work glued to the live stream of the Red Bull Stratos Mission. I watched the ball

26、oons positioned at different altitudes in the sky to test the winds, knowing that if they would just line up in a vertical straight line we would be go for launch.I feel this mission was created for me because I am also a journalist and a photographer, but above all I live for taking a leap of faith

27、 - the feeling of pushing the envelope into uncharted territory.The guy who is going to do this, Felix Baumgartner, must have that same feeling, at a level I will never reach. However, it did not stop me from feeling his pain when a gust of swirling wind kicked up and twisted the partially filled ba

28、lloon that would take him to the upper end of our atmosphere. As soon as the 40-acre balloon, with skin no thicker than a dry cleaning bag, scraped the ground I knew it was over.How claustrophobia almost grounded supersonic skydiverWith each twist, you could see the wrinkles of disappointment on the

29、 face of the current record holder and capcom (capsule communications), Col. Joe Kittinger. He hung his head low in mission control as he told Baumgartner the disappointing news: Mission aborted.The supersonic descent could happen as early as Sunday.The weather plays an important role in this missio

30、n. Starting at the ground, conditions have to be very calm - winds less than 2 mph, with no precipitation or humidity and limited cloud cover. The balloon, with capsule attached, will move through the lower level of the atmosphere (the troposphere) where our day-to-day weather lives. It will climb h

31、igher than the tip of Mount Everest (5.5 miles/8.85 kilometers), drifting even higher than the cruising altitude of commercial airliners (5.6 miles/9.17 kilometers) and into the stratosphere. As he crosses the boundary layer (called the tropopause), he can expect a lot of turbulence.The balloon will

32、 slowly drift to the edge of space at 120,000 feet (22.7 miles/36.53 kilometers). Here, Fearless Felix will unclip. He will roll back the door.Then, I would assume, he will slowly step out onto something resembling an Olympic diving platform.Below, the Earth becomes the concrete bottom of a swimming

33、 pool that he wants to land on, but not too hard. Still, hell be traveling fast, so despite the distance, it will not be like diving into the deep end of a pool. It will be like he is diving into the shallow end.Skydiver preps for the big jumpWhen he jumps, he is expected to reach the speed of sound

34、 - 690 mph (1,110 kph) - in less than 40 seconds. Like hitting the top of the water, he will begin to slow as he approaches the more dense air closer to Earth. But this will not be enough to stop him completely.If he goes too fast or spins out of control, he has a stabilization parachute that can be

35、 deployed to slow him down. His team hopes its not needed. Instead, he plans to deploy his 270-square-foot (25-square-meter) main chute at an altitude of around 5,000 feet (1,524 meters).In order to deploy this chute successfully, he will have to slow to 172 mph (277 kph). He will have a reserve par

36、achute that will open automatically if he loses consciousness at mach speeds.Even if everything goes as planned, it wont. Baumgartner still will free fall at a speed that would cause you and me to pass out, and no parachute is guaranteed to work higher than 25,000 feet (7,620 meters).It might not be

37、 the moon, but Kittinger free fell from 102,800 feet in 1960 - at the dawn of an infamous space race that captured the hearts of many. Baumgartner will attempt to break that record, a feat that boggles the mind. This is one of those monumental moments I will always remember, because there is no way Id miss this.7

展开阅读全文

毕业论文（设计）平行语料库处理初探 一种排序模型29338.doc

毕业论文（设计）平行语料库处理初探一种排序模型29338.doc