I wrote this program to keep track of a web page and notify me if it has changed. Note that it tracks specific web pages, not an entire web site. You can configure as many URLs as you want, so it will track each one (and notify different people), but it will not spider an entire site and tell you what's changed.
Possible Uses:
Download/Requirements:
This is a perl script. I've renamed it with ".txt" extension for easy downloading, so you'll need to change the name back to "urlnotify.pl" and make it executable.
The script should run just fine on any Linux system (perhaps Windows too). I'm using Perl v5.8.0, but the code contained should also work on some older versions. It requires the old "timelocal.pl" library.
You'll want to customize a few variables:
$diff = the path to your diff command
$wget = the path to your wget command
$sum = the path to your sum command
$tmpdir = the directory where this script stores its data
$cfgfile = the full path to the config data file
$from = the e-mail address that e-mails should come from (this variable is found in the 'notify' subroutine)
The config file:
The config data is a simple text file that you edit manually. It's formatted just like the old Windows INI files. Define the path to your config file using the $cfgfile variable. Here's a sample:
[001] name=SNC Wikipedia article url=http://en.wikipedia.org/wiki/St._Norbert_College email=scott.crevier@snc.edu,webmaster@snc.edu keephtml=0 [002] name=City of De Pere url=http://www.de-pere.org/ email=scott.crevier@snc.edu keephtml=0
Each web page is monitored individually. Each page in the config file above has a unique id number (inside square brackets). This really could be anything; I happen to use a three digit integer to keep it simple. Whatever you use, these must be unique. The values associated with each page are as follows:
| name | this is just a simple text name that is used in the e-mail notifications |
| url | this is the full URL to the page that is monitored |
| this is a comma-seperated list of e-mail addresses that will be notified when this web page changes (no spaces) | |
| keepemail | 1 = compare the page in its full unmodified format 0 = first strip all HTML code, then compare |
The Method:
The first time this script is run for any page, it simply downloads and saves a copy of the page (in the tmp directory). The page is named "url-xxx-new.html", where xxx is the unique id number.
The next time (and all subsequent times) it is run, it renames the previous version of the page to "url-xxx-old.html", downloads a new copy (url-xxx-new.html), and compares the two (using the "sum" command). If the checksum is different, it then uses the "diff" command to get a list of the differences. It then sends an e-mail to the configured addresses for that URL.
Note #1: The code to actually send the e-mail is not included in this script. Instead, the script just prints the e-mail message to STDOUT. You will need to insert your own code there (see the "sendEmail" subroutine). I use a custom Perl library for that, some folks use sendmail. If you're already using Perl to send e-mails in other scripts, you should be able to "borrow" that code.
Note #2: If you're familiar with the "diff" command, you'll know that its output may not be the most friendly thing to read. This script does some minor format changes in order to make the e-mail message look as simple as possible.
Note #3: The "diff" command uses the greater than (>) and less than (<) symbols in its output. This looks nice on a console display, but in an e-mail message, the greater than symbol is used to indicate quoted text from a previous message. So this script converts those to a different characters (» and «).
Note #4: Of course you'll need to schedule this script to run regularly. I recommend daily or weekly. Or if it's one of your own pages, and you need to know sooner, perhaps hourly would work.
The Results:
The results are e-mailed. If no changes are found, no e-mail is sent. Either way, info is always written to the log file so you can see what it's doing. Here are some sample log file entries:
2007-07-26 15:01:47 SNC Wikipedia article first run, no check performed 2007-07-26 15:01:47 City of De Pere first run, no check performed 2007-07-26 15:01:54 SNC Wikipedia article no change since Thu 26-Jul-2007 3:01pm 2007-07-26 15:01:54 City of De Pere no change since Thu 26-Jul-2007 3:01pm 2007-07-26 15:02:32 SNC Wikipedia article changed since Thu 26-Jul-2007 3:01pm; notification sent to scott.crevier@snc.edu,webmaster@snc.edu 2007-07-26 15:02:32 City of De Pere changed since Thu 26-Jul-2007 3:01pm; notification sent to scott.crevier@snc.edu 2007-07-26 15:17:07 SNC Wikipedia article no change since Thu 26-Jul-2007 3:02pm 2007-07-26 15:17:07 City of De Pere no change since Thu 26-Jul-2007 3:02pm
Testing:
I recommend that you test this BEFORE you insert your e-mail code (see note #1 above). This way, you can run it as many times as you need, and it will only display its results (and write to the log file). Then, once it's working, go ahead and insert the code so that it sends e-mails.
Feedback:
I'm always open to feedback, suggestions for improvement, etc. Contact me any time.