Of Spam and Source Code

Sunday, 2006-10-15; 16:08:00

More on the amount of spam e-mail, and source code for a spam e-mail counter

So about a month ago, I posted about the amount of spam I have been getting over the past year. I also noted about a utility called Sp@mX (again, forgive the annoying spelling: "He spelled it h-e-n-three-r-y. The 'three' is silent, you see.") by Hendrickson Software Components that seemed to be successful at cutting spam off at the source, rather than sorting out the spam at its destination.

I started reporting spam again on September 10 using Sp@mX. How's the experiment going? Here's an update with some newer pretty graphs showing the results.

The following graph shows the actual number of spam e-mails per day. The purple spike to towards the right end of the graph is just a visual indication showing when I started reporting spam. (It's purple because I couldn't, for the life of me, figure out how to change the color in Keynote 2.0. If I remember correctly from the iWork '06 trial, this might be a feature limited to Keynote 3.0.)

Total Spam Per Day Updated

Since the previous graph isn't terribly useful to gauge overall trends because of the wild fluctuations in amount of spam, this next graph is a 10-day running average of the number of spam e-mails I've received.

10-Day Running Average of Spam Received

And finally, here's a 20-day running average.

20-Day Running Average of Spam Received

You can see from the latter 2 graphs that the amount of spam I received in October is dramatically reduced from the level in September, both in my Stanford and my .mac accounts. The peak of spam was 101 spam e-mails on August 20. In terms of 10-day averages, the current level is at 33 compared to 90, so spam has been cut to about a third. In terms of 20-day averages, the current level is 41 as compared to 81; about half of peak levels.

The very curious thing, however, is that spam levels started falling even before I started reporting the spam. There's a gap of 3 weeks between the peak spam level and the day I started reporting spam on the absolute graph. So while spam has been dramatically reduced, I can't necessarily give any credit to Sp@mX.

I'd bet that the spam reporting is still having an effect, but it's hard to see on these graphs. The spam levels in August and September may have been anomalously high, and they were returning to normal levels from June and July before I started reporting spam. It's also possible that spam was going to return to its high levels if I hadn't been reporting spam, but the reporting kept the downward trend in effect.

It's impossible to judge exactly what is going on without a control account, which, unfortunately, I don't have. It's possible that I could only report spam from my Stanford account and leave the .mac spam alone, since that's where the majority of my spam comes from anyway. But I have a strong incentive to continue reporting spam from both accounts, especially because even 5 spam e-mails per day is quite annoying, IMHO.

More updates as this experiment continues. But so far, I am happy with the downward trend. Hopefully it keeps up. :)

You may be wondering how I make these graphs, especially since manually computing the running averages is not exactly trivial (in terms of time), and one would have to be pretty on top of things for the calculations not to get too overwhelming.
Funny anecdote, at least to programmers: initially, I had the program count spam e-mails by incrementing counters inside a series of dictionaries. Then, I would run through the dictionary and put the results into an ordered array that was much easier to iterate over when creating the final output. However, I started by going through the spam count dict first, tracking when I skipped over days and adding buffer zeros for these cases. That turned out to be so complicated, because I had to keep track of whether or not I skipped days at the end or beginning of months, compensating for the differences in numbers of days in months, and leap years as well.

I realized that a much better way would be to simply iterate through days from the start date to the end date, looking up spam count values as I went along. So instead of iterating through the dates that had spam count values, I'd instead check for spam count values for desired dates, and add buffer zeros when there was no spam count available. This allowed me to let an NSCalendarDate object deal with the number of days in a given month and all the problems associated with the former method. That portion of the code was reduced from 166 lines of code to 30 lines with the change in design.

Luckily, computers are perfectly suited for these kinds of things. I mentioned that I wrote a program to count the e-mails, and I'm releasing the source code for it. It's a small project, nothing too complicated, and pretty specific, but if you're interested in counting your spam e-mails (or just the dates of e-mails in general), here's your chance. :)

So without further ado, I present Spam E-Mail Counter 0.1, released under the MIT license.

Here's how to use it.

1. Build the project using Xcode.
2. In Mac OS X Mail, select the e-mails you want to count and select "Save As..." from the File menu.
3. Choose "Raw Message Source" from the popup menu of the save panel, a location for the file, and press "Save". (If you are exporting a lot of e-mails, you might want to open Mail's activity viewer to see the progress of the save operation.)
4. Open Spam E-Mail Counter, and choose start and end counting dates.
5. Click "Count Spam E-mails", find the file from Mail in the open panel, and press "Open".
6. Copy the data from the Console application or the built-in Xcode console. It's preformatted with tabs so you can just copy-paste into something like Keynote.

It's possible that you can use other e-mail clients, if they can export the source of a number of raw e-mails into a single file, provided that dates are in the correct format. Obviously the interface also needs some work, but it was sufficient for my needs. You can easily modify the source code to spit out running averages of whatever period in days you want, too.

Technological Supernova   Rants   Older   Newer   Post a Comment