以下为卖家选择提供的数据验证报告:
数据描述
Overview
There are three files here: the spam
folder is the original spam file; the ham
folder is the original non-spam email; spam_ham_data
is the CSV file I obtained after processing the previous two raw data, which can be used directly for further feature engineering and model training. Also, I have attached the code that handles these email files, please have a look at code
.
Please check my notebook, which shows you how to convert raw Spamassassin files to CSV files.
Hope it helps you understand how the CSV file was created.
Welcome
Welcome to the home page for the open-source Apache SpamAssassin Project.
Apache SpamAssassin is the #1 Open Source anti-spam platform giving system administrators a filter to classify email and block spam (unsolicited bulk email).
You can click https://spamassassin.apache.org/old/publiccorpus/ to check the original email data.
> I use the 2003_easy_ham
, 2003_hard_ham
, and 2003_spam
. (I merge 2003_easy_ham
, and 2003_hard_ham
into a single folder ham
.)
Features or Columns
- Email: the original data read from the original files. Use it to generate more features!
- Label:
0
meansham
,1
meansspam
. - Subject: the subject of an email.
- Content: the main body of an email.
Modeling
As I said before, you can use CSV files directly for modeling. I have demonstrated various machine-learning modeling processes for you in the code, and already got some relatively good results. Of course, you can get better cross-validation scores based on my baseline. (Actually, I haven't done any fine-tuning yet, you can definitely get better scores than mine.)
