I wrote a little e-mail harvest program to give me a list of all e-mail addresses in use on our web site. The results are written to two text files, one contains only our e-mail addresses and the other contains all non-SNC e-mail addresses. I run this script every night. Feel free to download and use it as you see fit.

Possible Uses:

We use the results for several things:

Download/Requirements:

This is a perl script. I've renamed it with ".txt" extension for easy downloading, so you'll need to change the name back to "harvestemails.pl" and make it executable.

harvestemails.txt

The script should run just fine on any Linux system (perhaps Windows too). I'm using Perl v5.8.0, but the code contained should also work on some older versions. The only module required is File::Find.

To run it for the first time, you'll want to customize these two variables:

$datafile_ours = path to the data file, your college e-mail addresses
$datafile_oth = path to the data file, external (non-college) e-mail addresses

Testing:

Before running the script, be sure to read through the code, especially all the comments. You'll find some tips that will be helpful in testing. I strongly recommend that you run this script on a smallish directory first, until you get familiar with what to expect. After you've done a test run a few times, then maybe run it on your entire web site.

The Method:

This script searches a directory structure for files that end with ".htm" or ".html". You can customize this to your liking. It's important to note that since I don't use http to get files, it's possible that this script will find orphan files (files that exist on your web site but are not linked to).

The Results:

The results are written to two simple text files, one contains only your school's e-mail addresses and the other contains all other e-mail addresses.

Here's an example of the output file format:

[email1@snc.edu]
/dir/index.html=26
/dir/sub/project.html=21,52

[email2@snc.edu]
/dir2/goodstuff/page.html=20,20

As you can tell, each e-mail address is listed once inside square brackets, followed by a list of web pages where that e-mail address is found. The line numbers on which it is found are also included. Note that if a standard e-mail link is used, like this:

<a href="mailto:email2@snc.edu">email2@snc.edu</a>

then the e-mail address actually exists twice on that line, so it will be reported as such in the results.

Viewing the Results:

This is a separate CGI script that shows this data to campus folks in our portal. I've cleaned it up so it's pretty generic looking. Of course you can pretty it up however you see fit. I've renamed it with ".txt" extension for easy downloading, so you'll need to change the name back to "emails.cgi". You'll probably need to put this in a directory on your web server where CGI scripts are allowed to run (i.e. cgi-bin), and also set the proper execute permissions.

emails.txt

To run it for the first time, you'll want to customize these three variables:

$collegeabbrev = abbreviation for your college name (i.e. SNC)
$datafile_ours = path to the data file, your college e-mail addresses
$datafile_oth = path to the data file, external (non-college) e-mail addresses

Of course the two datafile paths must match the paths you used in the 'harvestemails' script.

Other Ideas:

I've got plans to write something that parses through the list of our own addresses, and validate them against our directory via an LDAP lookup. If invalid e-mails are found (i.e. misspelled, someone leaves the college) an e-mail will be sent to me. Running this nightly insures that I'll be notified within 24 hours of any address going bad (so to speak). This will also catch those situations when someone puts an address like "william.smith@snc.edu" on the web when his actual e-mail address is "bill.smith@snc.edu".

Feedback:

I'm always open to feedback, suggestions for improvement, etc. Contact me any time.

 


Apache HTTP Server Project mod_perl Powered by Perl use perl SUSE Linux Enterprise Server
Linux Registered User #420090 written in the vi editor Mozilla Corporation Get Firefox Linklint - fast html link checker ipsCA SSL Secure Certificate