Welcome to the CSDMC2010 SPAM corpus, which is one of the datasets for
the data mining competition associated with ICONIP 2010.
This dataset is composed of a selection of mail messages, suitable for
use in testing spam filtering systems.
------------------------------------------------------
Pertinent points
- All headers are reproduced in full. Some address obfuscation has taken
place, and hostnames in some cases have been replaced with
"csmining.org" (which has a valid MX record) and with most of the recipents
replaced with 'hibody.csming.org' In most cases
though, the headers appear as they were received.
- All of these messages were posted to public fora, were sent to me in the
knowledge that they may be made public, were sent by me, or originated as
newsletters from public mail lists. A part of the data is from other
public corpus(es), however, for some reason, information will be open
after the competion.
- Copyright for the text in the messages remains with the original senders.
------------------------------------------------------
The corpus file -- CSDMC2010_SPAM.tar.bz2
On Linux platforms, it can be extracted by command
tar -xjf CSDMC2010_SPAM.tar.bz2 -C email/
In an MS Windows environment, use the bzip2 software
http://gnuwin32.sourceforge.net/packages/bzip2.htm
------------------------------------------------------
The corpus description
The dataset contains two parts:
- TRAINING: 4327 messages out of which there are 2949 non-spam messages (HAM)
and 1378 spam messagees (SPAM), all received from non-spam-trap sources.
SPAMTrain.label contains the labels of the emails, with 1 stands for a
HAM and 0 stands for a SPAM.
- TESTING: 4292 messages without known class labels.
------------------------------------------------------
The email format description
The format of the .eml file is definde in RFC822, and information on recent
standard of email, i.e., MIME (Multipurpose Internet Mail Extensions) can be
find in RFC2045-2049.
------------------------------------------------------
On the provide python script
Since some data mining techniques only make use of the subject and body of the
email to identify spam. In this package, we have included a simple python
script (ExtractContent.py) which can help to extract the subject and body of the email.
In a python compatible environment, ( the code is test on python 2.5.1 and should
work on python 2.x)
1, invoke the script by command
./ExtractContent.py
2, input source directory -- where you store the source files
For exmaple
C:\EMAILPro\CSDMC2010_SPAM\TEST
3, input destination directory -- where you want the extracted body to be
For example
C:\EMAILPro\CSDMC2010_SPAM\TEST_NEW
4, we are done
Note that, the script only extract limited information from the email (no
information of fields like to, from, attachment are extract but only the subject
and the first part of the body.) By oferring such a script we just want to show
a simple preprocessing mehtod where the participants can start from.
More advanced method which makes use of email header information or even attachment
information are encouraged.
------------------------------------------------------
Please direct any questions regarding this dataset to <bantao>at<nict>dot<go>dot<jp>.
没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
收起资源包目录
真实垃圾邮件数据集 (2000个子文件)
TRAIN_04126.eml 257KB
TRAIN_03932.eml 227KB
TRAIN_01242.eml 191KB
TEST_04114.eml 181KB
TEST_01796.eml 124KB
TRAIN_02402.eml 124KB
TRAIN_02394.eml 108KB
TRAIN_02338.eml 102KB
TEST_00524.eml 102KB
TRAIN_00815.eml 90KB
TEST_02125.eml 88KB
TEST_02686.eml 86KB
TEST_04032.eml 86KB
TEST_00813.eml 75KB
TRAIN_03739.eml 74KB
TRAIN_02652.eml 71KB
TEST_03262.eml 71KB
TEST_03847.eml 69KB
TEST_00784.eml 66KB
TEST_04126.eml 60KB
TRAIN_02606.eml 59KB
TRAIN_03761.eml 59KB
TRAIN_04305.eml 58KB
TRAIN_00644.eml 52KB
TEST_02916.eml 49KB
TEST_03654.eml 48KB
TEST_00005.eml 48KB
TEST_01203.eml 48KB
TEST_03387.eml 48KB
TEST_00647.eml 47KB
TRAIN_02799.eml 47KB
TEST_00204.eml 47KB
TEST_01750.eml 47KB
TRAIN_01679.eml 46KB
TRAIN_03560.eml 45KB
TRAIN_03667.eml 45KB
TEST_00617.eml 45KB
TEST_01228.eml 45KB
TRAIN_01313.eml 44KB
TEST_03609.eml 44KB
TRAIN_03628.eml 44KB
TRAIN_03217.eml 44KB
TEST_03778.eml 43KB
TEST_02893.eml 43KB
TEST_01813.eml 43KB
TRAIN_04092.eml 43KB
TRAIN_00122.eml 43KB
TRAIN_02743.eml 43KB
TRAIN_00432.eml 42KB
TEST_03268.eml 42KB
TEST_03129.eml 42KB
TRAIN_03769.eml 42KB
TRAIN_01398.eml 42KB
TRAIN_00856.eml 42KB
TRAIN_01199.eml 42KB
TEST_00094.eml 42KB
TRAIN_00947.eml 42KB
TEST_00150.eml 42KB
TEST_03233.eml 42KB
TRAIN_04213.eml 42KB
TRAIN_00960.eml 40KB
TEST_00860.eml 40KB
TEST_02762.eml 40KB
TRAIN_00013.eml 39KB
TEST_04266.eml 39KB
TEST_02883.eml 38KB
TEST_01209.eml 38KB
TEST_03945.eml 38KB
TEST_03866.eml 38KB
TRAIN_03226.eml 38KB
TEST_02481.eml 38KB
TEST_01391.eml 38KB
TEST_02726.eml 37KB
TRAIN_01368.eml 37KB
TEST_02596.eml 37KB
TEST_00575.eml 37KB
TRAIN_03182.eml 37KB
TEST_03791.eml 36KB
TRAIN_00411.eml 36KB
TRAIN_01029.eml 36KB
TEST_00409.eml 36KB
TRAIN_03415.eml 36KB
TRAIN_00887.eml 36KB
TEST_00588.eml 36KB
TRAIN_01808.eml 35KB
TRAIN_00732.eml 35KB
TRAIN_03016.eml 35KB
TRAIN_03582.eml 35KB
TRAIN_04185.eml 35KB
TEST_00250.eml 34KB
TRAIN_04130.eml 34KB
TRAIN_00317.eml 34KB
TRAIN_04195.eml 34KB
TEST_02803.eml 33KB
TEST_01189.eml 33KB
TEST_01561.eml 33KB
TEST_00187.eml 33KB
TEST_00811.eml 33KB
TEST_01992.eml 33KB
TEST_03794.eml 33KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
- spade_k12019-04-08可以的数据集
- Radish6082019-03-27下载下来的linux上打不开
- 喵喵与小鱼2018-09-17太感谢分享啦,感谢楼主分享这么有用的数据资料
luoyanum
- 粉丝: 5
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功