论文摘要
词语搭配是二语习得和自然语言处理的重要资源,然而基于语料库的词语搭配自动抽取一直以来都是一个难以解决的问题。基于语料库的词语搭配自动抽取通常是采用统计驱动的方法。但是在搭配抽取的时候,语言学家们往往随意选取一种搭配抽取算法,而没有考虑到语料类型和语料长度对搭配抽取效率产生的影响。这往往是导致搭配抽取效率低的一个重要原因。本文试图比较(mutual information, chi-square test, t-test and log-likelihood ratio)这四种词语搭配自动抽取方法不同抽取效率。本文主要研究以下两个问题:(1)对于同长度不同语域的语料,哪种自动抽取算法效率最高?(2)对于同语域不同长度的语料,哪种自动抽取算法效率最高?研究结果表明:(1)语料长度相等均为200万词时,针对学术文本和新闻报道文本而言,基于互信息值的词语搭配抽取算法效率最高;而对于小说语料而言,基于对数似然比的算法的抽取效率最高。(2)语料的语域相同时,新闻文本语料长度在100万词以内,基于对数似然比的算法的抽取效率最高;为语料长度超过100万词时,基于互信息值的算法的抽取效率最高。
论文目录
摘要ABSTRACTList of tablesList of figuresChapter 1 Introduction1.1 Research background1.2 Significance of the research1.3 The purpose of the study1.4 The layout of the thesisChapter 2 Literature review2.1 A short history of the study of collocation2.1.1 Collocation studies in the 1950s2.1.2 Collocation studies in the 1960s2.1.3 Collocation studies in the 1970s2.1.4 Collocation studies from 1980s till now2.2 The notion of collocation2.2.1 Definition of collocation by different scholars2.2.2 The working definition of collocation in this study2.3 Characteristics and properties of collocation2.3.1 Typical features of collocation2.3.2 Properties and functions of collocation2.3.3 Linguistic subclasses of collocation2.4 Previous Studies on collocation Extraction2.4.1 Corpus-based approaches to collocation extraction2.4.2 Algorithms used for collocation extraction2.4.2.1 Mutual Information2.4.2.2 Hypothesis TestingChapter 3 Research Methodology3.1 The Corpora used in this study3.1.1 An introduction to BNC corpus3.1.2 The sub-corpora of BNC3.1.2.1 The sub-corpora of different registers3.1.2.2 The sub-corpora of different sizes3.2. Tools for data collection3.3. Procedures for data collection and data analysis3.3.1 Tokenization3.3.2 Lemmatization3.3.3 Extraction of Bigrams and collocations3.3.4 Evaluation of the extraction efficiencyChapter 4 Data Analysis and Discussion4.1 Extraction results from corpus of different registers4.1.1 Comparison of the precision values4.1.2 Comparison of the recall value4.1.3 Detailed comparison of the extraction efficiency for corpus ofdifferent registers4.2 Extraction results from corpus of different sizes4.2.1 Comparison of the precision values4.2.2 Comparison of the recall values4.2.3 Detailed comparison of the extraction efficiency for corpus ofdifferent sizes4.3 DiscussionChapter 5 Conclusion5.1 Major Findings5.2 Limitation and suggestions for further studyReferenceAppendixAcknowledgementResume
相关论文文献
标签:词语搭配抽取论文; 算法论文; 语料库论文;