Improving Statistical Bayesian Spam Filtering Algorithms

论文摘要

The aim of this thesis is to improve accuracy of Bayesian spam filtering, the most popular and widely used approach in spam filtering. Among the various possible approaches to this aim, two approaches that improved the filtering performances are presented in this thesis. Three popular evolutions of Bayesian spam filtering algorithms: Naive Bayes, Paul Graham’s and Gary Robinson’s are reviewed. Formulated on top of those evolutions, proposed algorithms incorporate new novel ideas.The first approach proposed is co-weighting of multiple probability estimations. Though based on Bayesian theorem, several ways of computing probability estimations have been proposed and used. Those estimations are examined and a new, combined, more effective estimation based on co-weighted multi-estimations is proposed. The approach is compared with individual estimations.The second approach is based on co-weighted multi-area information. Bayesian spam niters, in general, compute probability estimations for tokens either without considering the email areas of occurrences except the body or treating the same token occurred in different areas as different tokens. However, in reality the same token occurring in different areas are inter-related and the relation too could play role in the classification. This novel idea is incorporated, co-relating multi-area information by co-weighting them and obtaining more effective combined integrated probability estimations for tokens. It is shown that this approach also improves the performance of spam filtering. The new approach is compared with individual area-wise estimations and traditional separate estimations in all areas.

论文目录

Table of Contents

List of Tables

List of Figures

Abstract

Acknowledgements

1 Introduction

1.1 Spam and its Types

1.2 Anti-spamming Techniques

1.3 Previous Works on Bayesian Spam Filtering

1.4 Contributions

1.5 Thesis Organization

2 Statistical Bayesian Spam Filtering Algorithms

2.1 Spam Filtering Steps

2.2 Naive Bayes （NB） Algorithm

2.3 Paul Graham's （PB） Algorithm

2.4 Gary Robinson's （GR） Algorithm

2.5 Dealing with Small Probabilities and Normalization

3 Preprocessing and Feature Selection

3.1 Preprocessing

3.2 Feature extraction or Tokenization

4 Filtering Based on Co-weighted Multi-estimations

4.1 Main Idea and Algorithm Description

4.2 Training Algorithm

4.3 Classification Algorithm

5 Filtering Based on Co-weighted Multi-area Information

5.1 Main Idea and Algorithm Description

5.2 Training Algorithm

5.3 Classification Algorithm

6 Dataset Collections and Evaluation Measures

6.1 Corpora Collections

6.2 Evaluation Measures

7 Experiments and Analysis

7.1 Parameters Tuning

7.2 Experiments with Co-weighted Multi-estimations

7.2.1 Experiments and Results

7.2.2 Analysis

7.3 Experiments with Co-weighted Multi-area Information

7.3.1 Experiments and Results

7.3.2 Analysis

8 Conclusions and Future Work

8.1 Conclusions

8.2 Future Work

Appendix

A Implementation of Filter Application

A.1 Data Structures

A.2 Source Files

A.3 Data Files

B Application User's Manual

B.1 System Requirements

B.2 Installation of the Application

B.3 Running and Using the Application

B.3.1 Dataset Preparer

B.3.2 Trainer

B.3.3 Classifier

B.3.4 Tester

C Program Documentation

C.1 Package and Class Summaries

C.1.1 Class Summary

C.1.2 Enum Summary

C.2 Hierarchy For Package rsspambayes

C.2.1 Class Hierarchy

C.2.2 Enum Hierarchy

C.3 Class Details

C.3.1 Algorithm

C.3.2 Category

C.3.3 Classifier

C.3.4 Counts

C.3.5 DatasetPreparer

C.3.6 FreqTable

C.3.7 Frequencies

C.3.8 GRobinsonBayes

C.3.9 NaiveBayes

C.3.10 PGrahamBayes

C.3.11 RShresthaBayesl

C.3.12 RShresthaBayes2

C.3.13 Stats

C.3.14 Tester

C.3.15 Tokenizer

C.3.16 Trainer

C.3.17 Utils

C.4 Enum Details

C.4.1 Algorithms

C.4.2 Areas

C.4.3 EmailCats

C.4.4 Headers

C.4.5 HtmlTags

C.4.6 Method Detail for All Enum Types

D List of Papers Published

Bibliography

Improving Statistical Bayesian Spam Filtering Algorithms

论文摘要

论文目录

相关论文文献

猜你喜欢