词语搭配自动抽取方法对比研究

词语搭配自动抽取方法对比研究

论文摘要

词语搭配是二语习得和自然语言处理的重要资源,然而基于语料库的词语搭配自动抽取一直以来都是一个难以解决的问题。基于语料库的词语搭配自动抽取通常是采用统计驱动的方法。但是在搭配抽取的时候,语言学家们往往随意选取一种搭配抽取算法,而没有考虑到语料类型和语料长度对搭配抽取效率产生的影响。这往往是导致搭配抽取效率低的一个重要原因。本文试图比较(mutual information, chi-square test, t-test and log-likelihood ratio)这四种词语搭配自动抽取方法不同抽取效率。本文主要研究以下两个问题:(1)对于同长度不同语域的语料,哪种自动抽取算法效率最高?(2)对于同语域不同长度的语料,哪种自动抽取算法效率最高?研究结果表明:(1)语料长度相等均为200万词时,针对学术文本和新闻报道文本而言,基于互信息值的词语搭配抽取算法效率最高;而对于小说语料而言,基于对数似然比的算法的抽取效率最高。(2)语料的语域相同时,新闻文本语料长度在100万词以内,基于对数似然比的算法的抽取效率最高;为语料长度超过100万词时,基于互信息值的算法的抽取效率最高。

论文目录

  • 摘要
  • ABSTRACT
  • List of tables
  • List of figures
  • Chapter 1 Introduction
  • 1.1 Research background
  • 1.2 Significance of the research
  • 1.3 The purpose of the study
  • 1.4 The layout of the thesis
  • Chapter 2 Literature review
  • 2.1 A short history of the study of collocation
  • 2.1.1 Collocation studies in the 1950s
  • 2.1.2 Collocation studies in the 1960s
  • 2.1.3 Collocation studies in the 1970s
  • 2.1.4 Collocation studies from 1980s till now
  • 2.2 The notion of collocation
  • 2.2.1 Definition of collocation by different scholars
  • 2.2.2 The working definition of collocation in this study
  • 2.3 Characteristics and properties of collocation
  • 2.3.1 Typical features of collocation
  • 2.3.2 Properties and functions of collocation
  • 2.3.3 Linguistic subclasses of collocation
  • 2.4 Previous Studies on collocation Extraction
  • 2.4.1 Corpus-based approaches to collocation extraction
  • 2.4.2 Algorithms used for collocation extraction
  • 2.4.2.1 Mutual Information
  • 2.4.2.2 Hypothesis Testing
  • Chapter 3 Research Methodology
  • 3.1 The Corpora used in this study
  • 3.1.1 An introduction to BNC corpus
  • 3.1.2 The sub-corpora of BNC
  • 3.1.2.1 The sub-corpora of different registers
  • 3.1.2.2 The sub-corpora of different sizes
  • 3.2. Tools for data collection
  • 3.3. Procedures for data collection and data analysis
  • 3.3.1 Tokenization
  • 3.3.2 Lemmatization
  • 3.3.3 Extraction of Bigrams and collocations
  • 3.3.4 Evaluation of the extraction efficiency
  • Chapter 4 Data Analysis and Discussion
  • 4.1 Extraction results from corpus of different registers
  • 4.1.1 Comparison of the precision values
  • 4.1.2 Comparison of the recall value
  • 4.1.3 Detailed comparison of the extraction efficiency for corpus ofdifferent registers
  • 4.2 Extraction results from corpus of different sizes
  • 4.2.1 Comparison of the precision values
  • 4.2.2 Comparison of the recall values
  • 4.2.3 Detailed comparison of the extraction efficiency for corpus ofdifferent sizes
  • 4.3 Discussion
  • Chapter 5 Conclusion
  • 5.1 Major Findings
  • 5.2 Limitation and suggestions for further study
  • Reference
  • Appendix
  • Acknowledgement
  • Resume
  • 相关论文文献

    标签:;  ;  ;  

    词语搭配自动抽取方法对比研究
    下载Doc文档

    猜你喜欢