Improving Statistical Bayesian Spam Filtering Algorithms

Improving Statistical Bayesian Spam Filtering Algorithms

论文摘要

The aim of this thesis is to improve accuracy of Bayesian spam filtering, the most popular and widely used approach in spam filtering. Among the various possible approaches to this aim, two approaches that improved the filtering performances are presented in this thesis. Three popular evolutions of Bayesian spam filtering algorithms: Naive Bayes, Paul Graham’s and Gary Robinson’s are reviewed. Formulated on top of those evolutions, proposed algorithms incorporate new novel ideas.The first approach proposed is co-weighting of multiple probability estimations. Though based on Bayesian theorem, several ways of computing probability estimations have been proposed and used. Those estimations are examined and a new, combined, more effective estimation based on co-weighted multi-estimations is proposed. The approach is compared with individual estimations.The second approach is based on co-weighted multi-area information. Bayesian spam niters, in general, compute probability estimations for tokens either without considering the email areas of occurrences except the body or treating the same token occurred in different areas as different tokens. However, in reality the same token occurring in different areas are inter-related and the relation too could play role in the classification. This novel idea is incorporated, co-relating multi-area information by co-weighting them and obtaining more effective combined integrated probability estimations for tokens. It is shown that this approach also improves the performance of spam filtering. The new approach is compared with individual area-wise estimations and traditional separate estimations in all areas.

论文目录

  • Table of Contents
  • List of Tables
  • List of Figures
  • Abstract
  • Acknowledgements
  • 1 Introduction
  • 1.1 Spam and its Types
  • 1.2 Anti-spamming Techniques
  • 1.3 Previous Works on Bayesian Spam Filtering
  • 1.4 Contributions
  • 1.5 Thesis Organization
  • 2 Statistical Bayesian Spam Filtering Algorithms
  • 2.1 Spam Filtering Steps
  • 2.2 Naive Bayes (NB) Algorithm
  • 2.3 Paul Graham's (PB) Algorithm
  • 2.4 Gary Robinson's (GR) Algorithm
  • 2.5 Dealing with Small Probabilities and Normalization
  • 3 Preprocessing and Feature Selection
  • 3.1 Preprocessing
  • 3.2 Feature extraction or Tokenization
  • 4 Filtering Based on Co-weighted Multi-estimations
  • 4.1 Main Idea and Algorithm Description
  • 4.2 Training Algorithm
  • 4.3 Classification Algorithm
  • 5 Filtering Based on Co-weighted Multi-area Information
  • 5.1 Main Idea and Algorithm Description
  • 5.2 Training Algorithm
  • 5.3 Classification Algorithm
  • 6 Dataset Collections and Evaluation Measures
  • 6.1 Corpora Collections
  • 6.2 Evaluation Measures
  • 7 Experiments and Analysis
  • 7.1 Parameters Tuning
  • 7.2 Experiments with Co-weighted Multi-estimations
  • 7.2.1 Experiments and Results
  • 7.2.2 Analysis
  • 7.3 Experiments with Co-weighted Multi-area Information
  • 7.3.1 Experiments and Results
  • 7.3.2 Analysis
  • 8 Conclusions and Future Work
  • 8.1 Conclusions
  • 8.2 Future Work
  • Appendix
  • A Implementation of Filter Application
  • A.1 Data Structures
  • A.2 Source Files
  • A.3 Data Files
  • B Application User's Manual
  • B.1 System Requirements
  • B.2 Installation of the Application
  • B.3 Running and Using the Application
  • B.3.1 Dataset Preparer
  • B.3.2 Trainer
  • B.3.3 Classifier
  • B.3.4 Tester
  • C Program Documentation
  • C.1 Package and Class Summaries
  • C.1.1 Class Summary
  • C.1.2 Enum Summary
  • C.2 Hierarchy For Package rsspambayes
  • C.2.1 Class Hierarchy
  • C.2.2 Enum Hierarchy
  • C.3 Class Details
  • C.3.1 Algorithm
  • C.3.2 Category
  • C.3.3 Classifier
  • C.3.4 Counts
  • C.3.5 DatasetPreparer
  • C.3.6 FreqTable
  • C.3.7 Frequencies
  • C.3.8 GRobinsonBayes
  • C.3.9 NaiveBayes
  • C.3.10 PGrahamBayes
  • C.3.11 RShresthaBayesl
  • C.3.12 RShresthaBayes2
  • C.3.13 Stats
  • C.3.14 Tester
  • C.3.15 Tokenizer
  • C.3.16 Trainer
  • C.3.17 Utils
  • C.4 Enum Details
  • C.4.1 Algorithms
  • C.4.2 Areas
  • C.4.3 EmailCats
  • C.4.4 Headers
  • C.4.5 HtmlTags
  • C.4.6 Method Detail for All Enum Types
  • D List of Papers Published
  • Bibliography
  • 相关论文文献

    • [1].瞿大掌门转换频道[J]. 温州人 2015(19)
    • [2].主旋律弘扬与观赏性表达——简论抗战传奇剧《大掌门》的叙事美学[J]. 当代电视 2014(02)
    • [3].开学第一天[J]. 第二课堂(C) 2018(02)
    • [4].电视剧《大掌门》于央视八套黄金档播出[J]. 当代电视 2014(01)
    • [5].浅谈胡琴三大掌门的异同[J]. 作家 2009(12)
    • [6].《中式生活》读者谈中式[J]. 商品与质量 2012(42)
    • [7].2014,人很暖,正能量很满[J]. 纺织机械 2014(09)
    • [8].刘国夫:艺术是人生的一场意外[J]. 中国产业 2012(02)
    • [9].苏米作品[J]. 艺术界 2009(05)
    • [10].是文理还是武林[J]. 高中生学习(高二版) 2013(Z1)
    • [11].网络,我想对你说[J]. 初中生优秀作文 2008(01)
    • [12].毒舌开[J]. 广东第二课堂(上半月小学生阅读) 2018(Z2)
    • [13].外公小传[J]. 好家长 2017(Z1)
    • [14].那些年被玩坏的XX体[J]. 故事家 2015(07)
    • [15].海南高菜价之谜[J]. 农产品市场周刊 2016(09)
    • [16].打造文化抗战的年画传奇——电视剧《大掌门》专家研讨会综述[J]. 中国电视 2014(04)
    • [17].微创新——手游发展催化剂[J]. 互联网周刊 2013(22)
    • [18].锣声紧急[J]. 山花 2012(20)
    • [19].2014年40位40岁以下投资人[J]. 创业邦 2014(05)
    • [20].“酷讯系”的新产品[J]. 创业邦 2013(Z1)
    • [21].五大掌门齐贺WEY的两周岁,魏建军的法宝是什么?[J]. 汽车纵横 2019(01)
    • [22].飞翔[J]. 今古传奇(故事月末) 2012(11)
    • [23].强化10kV线路管控,推进线损精益出效益[J]. 科技传播 2011(23)
    • [24].文字作品并非改编游戏作品的“免费午餐”——以《大掌门》改编《四大名捕》案为例[J]. 今日财富(中国知识产权) 2016(04)
    • [25].云服务混战,小公司有戏吗?[J]. 创业邦 2013(11)
    • [26].【岛叔说】假如你是东北一个官儿[J]. 中国经济周刊 2016(34)
    • [27].黔东南作家风采[J]. 杉乡文学 2013(01)
    • [28].北京中医药大学[J]. 高校招生 2011(09)
    • [29].诗歌[J]. 老年教育(老年大学) 2009(03)
    • [30].丝绸路上的几抹痕迹[J]. 草原 2011(07)
    Improving Statistical Bayesian Spam Filtering Algorithms
    下载Doc文档

    猜你喜欢